NotesWhat is notes.io?

Notes brand slogan

Notes - notes.io

1. Data is information that is processable by computer. The info is in digital form; and a program can read and analyze the information.

2. Accessible data is when you have rights to access the data, and you know the process to obtain and use the data.

3. API is an application programming interface -- a document that specifies what queries are available, their format, the query response format.

4. Types of data:
Tabular data uTime series data uNetworked data uGeospatial data uText data uMultimedia data

5. Image “Segmentation”: Finding Lines in an Image

6. Data silos: hard to get data out, hard to integrate data across silos

7. Raw data is newly collected data before any pre-processing/cleaning. It typically has errors and missing values, and needs further processing before analysis.

8. Open data: data that is open to anyone, some have licenses to restrict the usages.

9. Big data 3 Vs: volume, variety, velocity: very fast collection rate

10. A compiler is a computer program that transforms source code written in a programming language (the source language) into another computer language (the target language), with the latter often having a binary form known as object code. It bridges human generated code with machine-executable code.

11. An algorithm is a mechanical procedure that describes how to carry out a computation on some data. The logic to process data.

12. Turing machine: A Turing machine is a hypothetical machine thought of by the mathematician Alan Turing in 1936. Despite its simplicity, the machine can simulate ANY computer algorithm, no matter how complicated it is!

13. Programming languages are said to be Turing-complete because they can be used to implement a Turing machine !In order to do so, they must include special instructions such as iterations and conditional statements. All programming languages are Turing-complete and therefore they are computationally equivalent (equally powerful)

14. A workflow is a composition of functions.

15. computational workflow: Workflow is represented as a graph of connected nodes, Nodes represent programs and data (alternatively)
Links represent how data flows from program to program (output to input)
Computational workflows are compositions of programs

16. Single processing steps or components of a workflow can basically be defined by three parameters:
input description: the information, material and energy required to complete the step
transformation rules, algorithms, which may be carried out by associated human roles or machines, or a combination
output description: the information, material and energy produced by the step and provided as input to downstream steps.

17. Each time the workflow is executed, the system records the provenance of the results: what workflow was used, what its components were, what the input data was, and what values were assigned to the parameters.

18. Repeatability: the same lab can re-run a data analysis method and get the same results .
Replication: another lab can re-run a data analysis method and get the same results
Reproducibility: another lab can run a data analysis method with different data
Reuse: another lab can run a data analysis method (or parts of it) for a different experiment

19. Notebooks: record data, software, results, notes, etc. Records what code was run when generating a result; and Can re-run code with new data

20. Four data analysis tasks: clustering, classification, pattern detection, simulation.
Classification: assign a new category to a new instance
Clustering: form clusters with a set of instances
Pattern recognition: identify regularities with temporal or spatial data
Simulation: define mathematical formulas that can generate actual observed data

21. Classification: given a set of classes, and each class has instances. Output: a model that assigns a class to a new instance.
Instances have features/values/attributes.
Class is also called label. -- input called labeled instances

22. Decision tree: nodes are attributes based decision, Branches are alternative values of the attributes; leaf: each leaf is a class

23. Training set: training instances as the initial input to train the classification model. Test set: testing instances that are the input for classifier waiting for classes.

24. Contamination: when training set and testing set overlap -- WRONG

25. Requirements about classification: classes disjoint -- an instance have only one class.

26. Classification modeler: A mathematical/algorithmic approach to infer generalizable features from training instances, and hypothesize classes for new instances. It generates a model.

27. Ensemble learning: an emsembler uses several algorithms for the same task, and combine the results. Combination function joins the results -- majority vote or weighted voting.

28. Evaluation:
accuracy = correct labeled isntances/ test instances
n-fold cross-validation: divide the instances into n-folds of equal size. Run the classifier n times with each subset as a training set. Each run gives a accuracy score.
Confusion matrix: positive and negative (classified); true false (compared to the actual results)
Precision: True positive out of all labled positive;
Recall: true positive out of all actual positive (including true positive, false negative)
Explainability:

29: The complexity of classification: Large amounts of features (high dimensionality); Sparse data - features appear very few times

30. A model overfits the training set very accurately, then it may not classify test sets very well.

31: induction: inferring general rules from examples seen in the past. Classifiers use induction - they generate general rules about the classes.

32. Latent variable: features not directly observable.

33. supervised learning: training set is annotated. Unsupervised: not annotated data. Semi-supervised

34. A pattern language -- which describes the pattern

35. Challenge: concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.

36. Pattern detection: input a data set and another set of patterns; output matches data to patterns
Pattern learning: annodated data with patterns. Output: patterns with appearing frequency
Pattern discovery: input data. Output: a set of patterns that appear with some frequency

37. Clustering: given - a set of data with features (feature vectors) and a target number of k clusters; output - find the best match between data points and clusters

38. K-means clustering: first start with k random data points, each new data point is added to the neared cluster center. For each new cluster, find the centroid and make it the new cluster center. Iterate through. Till all data points distance to their cluster center is minimized.

39. Simulation: make predictions over observed data

40. To establish causality: randomized controlled experiments.

41. Ways to express causality:
probabilistic graphical model: graphs that captures dependencies among variables. Nodes are variables links are dependencies.
bayesian network: directed edges show the direction of influence. No cycles allowed. It is draw from probabilistic distribution functions that specify the prob relationship.
Bayesian inference: is used to reason over a bayesian network.
markov network: edges have no direction. It gives a potential function for each clique (at least 2 variables) of interconnected nodes.
Causal model: a bayesain network where are relationships are causal.

42. How to learn causal models: parameter learning, structure learning. One is learning the probabilities of the mode; the latter is learning the structure of the model.

43. Pre processing steps:
reformatting - change format like jpg to pdf
data conversion - change the unit of measurement, like from c to farenheit
cleaning - cleaning up errors
imputation - filling missing values
integration - mapping variables across data set, merging tables
feature generation - new variables based on old variables
feature construction - combing new variables
feature selection - subsetting with certain criteria

44. Text from web: first extract test from markup language; special software for screen scraping to get what is shown on the screen

45. Text pre-processing:
OCR - optical character recognition
stemming - same root
parsing- from a sentence, extract out the grammar structure
entity recognition -
entity resolution and record linkage - same entity difference names.

46. The three pillars of provenance: Resources, attributions, processes.

47. Def of provenance: a record that describes entities and processes involved in creating or delivering or otherwise influencing that data.
Benefit of provenance: enabling trust, assessing authenticity, allowing reproducibility

48. Representation of provenance: Dublin core (terms to be filled, a schema of provenance)
Entity wasDerivedfrom Entity itself
Entity wasGeneratedBy Activity
Entity wasAttributedto Agent
Agent actedOnBEhalfof Agent
Activity wasAssociatedwith Agent
Activity used Entity

Entities are things
Activities are how entities come into existence and how their attributes change to become new entities, often making use of previously existing entities to achieve this.
Activities generate new entities. Activities make use of entities.
An agent takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place.
An entity wasDerived from another entity. An entity wasRevised from another entity
hadPlan some instructions

49. Collection of time series data: sampling rate - frequency of collection; coverage - granularity in space. Adaptive sampling - changing sampling rate under certain conditions. Streaming data - collection is continuous. Real-time processes vs batch processing.

50. Pre-processing: rescaling - changing to another granulairty.
Decomposition: separating the three elements of time series data -- trend, seasonal, remainder

51. alert system - track a variable, when reach a condition, set an alarm
event detection - Input a pattern with variables and range changes. Output a match of the event pattern against the time series data
Event trigger - Input an event pattern. Output - an earlier event that has a causal relationship with the pattern
Causality detection - Input a time series. Output - events that may be causally related
Granger causality - a timeseries X influence timeseries Y if past X predicts Y values
Discovery of unexpected events - input a timeseries. Output an unusual pattern
Pattern mining - given a time series, find out the changes of variables overtime, the patterns, correlations

52. Image processing:
rotation -
translation - moving in space
color filters - add filtering color (b & w)
Inpainting - reconstructing damanged parts in images.
smoothing - reducing pixels to reduce noises
Brightness and contrast
Edge detection - detects sharp changes in brightness - finds objects but too accurate
segmentation - find objets with similar pixels. Divide into segments or areas
Object recognition - given an object, produce all the appearances in the image
Object tracking - recognition + tracking
activity recognition - recognition + tracking + a known pattern

53> Three things in geographical feature model: point, line, area

54. Algorithmic complexity: linear complexity when execution times grows linearly with input data size
polynomial complexity: O(n^k) n is the data size -- often for iterations
expo complexity: O(k^n) -- often for going over the document again and again

55. Three steps: split, process, join

56. message passing may be required in algorithms when steps need to exchange information, thus not parallizable.

57. Speedup- s = timesequential/timeparallel

58. Critical path - a seris of consecutieve steps that are interdependent thus not parallizable.

59. Amdahl's law - maximum speedup = 1/1-p (P is parallizable part

60. Multi-core computing: shared memory or distributed memory; mixed memory structure

61. GPU - graphical processing units -- do simple calculation and display images, cheap

62. FLOPS floating operations per section, is a measure of computer performance.

63. distributed computing -- a parallel computing paradigm wehre individual cores orchestrated over a network.
Web services -- where third parties offer computation service and orchestrated over a network
Grid computing -- processors are orchestrated over a middleware control center
Cluster computing - computers of very similar nature are orchestrated through a head node.

64. Virtual machines are frozen versions of software in a machine that's needed to run an application

65. parallel programming languages: that contain specific instructions to use multiple processors -- mapreduce and hadoop provides these languages and an ecosystem to utilize it

66. metadata types; descriptive, data characteristics, provenance metadata

67. benefits of metadata: reuse, authenticity, queries on data repositories, explain data analysis, automated data integration

68. Vocabulary - terms to describe metadata. A standard is a vocabulary agreed upon by the community.
Domain -specific or domain-independent

69. Knowledge is a set of beliefs held by an agent that determines its behavior.
Knowledge representation - AI field to develop machine language to represent knowledge
Meta knowledge - things can be inferred from knowledge

70. Descriptive knowledge:
classes - types of objects
instances of classes
property type
property value
constraints

71. Symbols are labels for entities
Knowledge base - a set of beliefs expressed in knwoledge representation language and used by a system to generate behaviors
knowledge representation language - a notation for how to use symbols to represent beliefs + algorithm for how to use the notation to do reasoning
Knowledge system - contains a knowledge base to generate behavior, behaviors change with beliefs, if it exhibit wrong behavior, beliefs can be changed to fix it. And reason over to infer, explain the behaviors.

72. Reasoning uses symbols and logic rules for inferences

73. Challenges for knowledge base - incomplete, inaccurate, inconsistent
Challenges for logic system - undecidable, complex expo complexity





     
 
what is notes.io
 

Notes.io is a web-based application for taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000 notes created and continuing...

With notes.io;

  • * You can take a note from anywhere and any device with internet connection.
  • * You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
  • * You can quickly share your contents without website, blog and e-mail.
  • * You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
  • * Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 12 years and has been free since the day it was started.


You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;


Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio



Regards;
Notes.io Team

     
 
Shortened Note Link
 
 
Looding Image
 
     
 
Long File
 
 

For written notes was greater than 18KB Unable to shorten.

To be smaller than 18KB, please organize your notes, or sign in.