Notes

Notes - notes.io

You have to be very specific when you ask an AI to do something for you. If you ask it to solve the global famine problem, the AI will kill all humans. If you ask it to save as many living species as possible, it will kill all humans. Well, in fact, it will try to kill all humans or destroy the Earth whatever you ask it to do unless you put the right constraints.
Machine Learning is getting machines to quickly make sense of large sets of input data
strong intellectual curiosity
Conduct analysis using statistical modeling and machine learning techniques.ability to manipulate, transform and summarize data

Steps
1. Have a data dictionary. This means, the explanation for each variable, what does it stand for or its definition.

2. Obtain a schema for the dataset. That commonly means getting the type for each variable.

3. Get a description of the dataset. With several Python or R packages you can get some descriptive statistics for each variable (mean, median, std, etc.), and that will help you see how the data is distributed.
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data (()
4. Exploratory data analysis. Play with your data, plot histograms, scatter plots between variables, move the data around, find if there are inconsistencies or something weird. Document each step you do here.

5. Clean the data. The last step should gave you an idea if the data is cleaned. If you find that's not the case, DO IT.

PYTHON
NumPy
Arrays, multidimentional array, basically deals with numerical data manipulation

Pandas
Library bulild on numpy that provides various data structures like DataFrame, Series and indexing and slicing, Multiindex, handing of NaN

Scikit-Learn
Machine learning in python

Seaborn
Seaborn is a Python visualization library bulid on top of matplotlib. It provides a high-level interface for drawing attractive statistical graphics, themes, color palletes, beautiful plots

Matplotlib
Python's data visulization and plotting libarary

SQL
things such as the number of disk I/Os that are required to evaluate the plan, the plan’s CPU cost and the overall response time that can be observed by the database client and the total execution time are essential.
In case you have correlated subqueries that have EXISTS, you should try to use a constant in the SELECT statement of that subquery instead of selecting the value of an actual column. This is especially handy when you’re checking the existence only
When you use the LIKE operator in a query, the index isn’t used if the pattern starts with % or _. It will prevent the database from using an index
When you use the OR operator in your query, it’s likely that you’re not using an index.
Order of table in joins
Redundant Conditions on Joins
GROUP BY clause to restrict the groups of returned rows to only those that meet certain conditions. However, if you use this clause in your query, the index is not used

Apache Spark
Spark is successor of Hadoop MapReduce and handles latency very good by using programming model rather than distributing intermediate data over n/w in case of failure. Started Using in Python using pyspark

Java

Tableau

Pentaho DI (Spoon PDI)

Gradient descent
Technique to find the minimum of cost function or mean squared error function

Multivariate Analysis
Correlation matrix
T-tests
t tests is type of hypothesis testing when some claim is made and we dont know the population parameters and we try to estmate them and check whether the claim made cis statistical significant under a confidence interval, df

Chi Squared tests
Chi squared test is th etest of variation, we see the diff is significant or not to reject a hypothesis, df
Clustering
Unsupervised Learning In which there is no labelled data and K Means Clustering

Segmentation

Cross Validation/Precision Recall/ROC
Classifier score of each test output

Linear Regression

Logistic Regression
Dependent variable is binary i.e. 0 or 1. Also, there should not be outliers i.e. Z is -3.29 to 3.29, Also no strong corelation among independent variables
Methods:
Residual plots
Cross Validation
Precision Recall Curves (Only for discrte Dependent Variable)
Confusion Matrix (Only for discrte Dependent Variable)

Decision Trees
Max_leaf_nodes max no of leaves in the node
Max_depth max no of split points
Min_sample_leaf mim no of sample leaves required for further splitting

SVM
SVC Classifiers
Maps data like -1 or +1 (More like logistic regression in which it is 0 or 1)
Uses Sign output of hypothesis function
Maximum margin linear classifier is SVC (How much width of classifier can be increased before hitting a data point)
C determines regularization, larger C less regularization
Alpha is another hyper parameter that controls weights assigned to L1 and L2 penalty
Kernalized SVM is for the data that is not linearly separable
Gamma comes in the picture here (Kernalized width parameter)
More gamma, more overfitting and short width of areas
C and gamma combination determines regularization
Not very useful for larger datasets like >50,000

Principle Component Analysis
Principal Component Analysis is used to remove the redundant features from the datasets without losing much information.
Method to transform n dimention space in 2d space by removing unwanted features. The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.
Thus
from sklearn.decomposition import PCA
pca = PCA().fit(X_std)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlim(0,7,1)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
The above plot shows almost 90% variance by the first 6 components. Therfore we can drop 7th component.

Ensemble Modelling/Random Forest

A/B testing
A/B testing is used everywhere. Marketing, retail, newsfeeds, online advertising, and more.
A/B testing is all about comparing things.
If you’re a data scientist, and you want to tell the rest of the company, “logo A is better than logo B”, well you can’t just say that without proving it using numbers and statistics.

Jupyter Notebook

Oracle 10g

Eclipse

WebLogic 9.x

Fetched data using dat from various websites using locu.com in python

Notes is a web-based application for online taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000+ notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 14 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes