Notes
Notes - notes.io |
Machine Learning is getting machines to quickly make sense of large sets of input data
strong intellectual curiosity
Conduct analysis using statistical modeling and machine learning techniques.ability to manipulate, transform and summarize data
Steps
1. Have a data dictionary. This means, the explanation for each variable, what does it stand for or its definition.
2. Obtain a schema for the dataset. That commonly means getting the type for each variable.
3. Get a description of the dataset. With several Python or R packages you can get some descriptive statistics for each variable (mean, median, std, etc.), and that will help you see how the data is distributed.
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data (()
4. Exploratory data analysis. Play with your data, plot histograms, scatter plots between variables, move the data around, find if there are inconsistencies or something weird. Document each step you do here.
5. Clean the data. The last step should gave you an idea if the data is cleaned. If you find that's not the case, DO IT.
PYTHON
NumPy
Arrays, multidimentional array, basically deals with numerical data manipulation
Pandas
Library bulild on numpy that provides various data structures like DataFrame, Series and indexing and slicing, Multiindex, handing of NaN
Scikit-Learn
Machine learning in python
Seaborn
Seaborn is a Python visualization library bulid on top of matplotlib. It provides a high-level interface for drawing attractive statistical graphics, themes, color palletes, beautiful plots
Matplotlib
Python's data visulization and plotting libarary
SQL
things such as the number of disk I/Os that are required to evaluate the plan, the plan’s CPU cost and the overall response time that can be observed by the database client and the total execution time are essential.
In case you have correlated subqueries that have EXISTS, you should try to use a constant in the SELECT statement of that subquery instead of selecting the value of an actual column. This is especially handy when you’re checking the existence only
When you use the LIKE operator in a query, the index isn’t used if the pattern starts with % or _. It will prevent the database from using an index
When you use the OR operator in your query, it’s likely that you’re not using an index.
Order of table in joins
Redundant Conditions on Joins
GROUP BY clause to restrict the groups of returned rows to only those that meet certain conditions. However, if you use this clause in your query, the index is not used
Apache Spark
Spark is successor of Hadoop MapReduce and handles latency very good by using programming model rather than distributing intermediate data over n/w in case of failure. Started Using in Python using pyspark
Java
Tableau
Pentaho DI (Spoon PDI)
Gradient descent
Technique to find the minimum of cost function or mean squared error function
Multivariate Analysis
Correlation matrix
T-tests
t tests is type of hypothesis testing when some claim is made and we dont know the population parameters and we try to estmate them and check whether the claim made cis statistical significant under a confidence interval, df
Chi Squared tests
Chi squared test is th etest of variation, we see the diff is significant or not to reject a hypothesis, df
Clustering
Unsupervised Learning In which there is no labelled data and K Means Clustering
Segmentation
Cross Validation/Precision Recall/ROC
Classifier score of each test output
Linear Regression
Logistic Regression
Dependent variable is binary i.e. 0 or 1. Also, there should not be outliers i.e. Z is -3.29 to 3.29, Also no strong corelation among independent variables
Methods:
Residual plots
Cross Validation
Precision Recall Curves (Only for discrte Dependent Variable)
Confusion Matrix (Only for discrte Dependent Variable)
Decision Trees
Max_leaf_nodes max no of leaves in the node
Max_depth max no of split points
Min_sample_leaf mim no of sample leaves required for further splitting
SVM
SVC Classifiers
Maps data like -1 or +1 (More like logistic regression in which it is 0 or 1)
Uses Sign output of hypothesis function
Maximum margin linear classifier is SVC (How much width of classifier can be increased before hitting a data point)
C determines regularization, larger C less regularization
Alpha is another hyper parameter that controls weights assigned to L1 and L2 penalty
Kernalized SVM is for the data that is not linearly separable
Gamma comes in the picture here (Kernalized width parameter)
More gamma, more overfitting and short width of areas
C and gamma combination determines regularization
Not very useful for larger datasets like >50,000
Principle Component Analysis
Principal Component Analysis is used to remove the redundant features from the datasets without losing much information.
Method to transform n dimention space in 2d space by removing unwanted features. The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.
Thus
from sklearn.decomposition import PCA
pca = PCA().fit(X_std)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlim(0,7,1)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
The above plot shows almost 90% variance by the first 6 components. Therfore we can drop 7th component.
Ensemble Modelling/Random Forest
A/B testing
A/B testing is used everywhere. Marketing, retail, newsfeeds, online advertising, and more.
A/B testing is all about comparing things.
If you’re a data scientist, and you want to tell the rest of the company, “logo A is better than logo B”, well you can’t just say that without proving it using numbers and statistics.
Jupyter Notebook
Oracle 10g
Eclipse
WebLogic 9.x
Fetched data using dat from various websites using locu.com in python
|
Notes.io is a web-based application for taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000 notes created and continuing...
With notes.io;
- * You can take a note from anywhere and any device with internet connection.
- * You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
- * You can quickly share your contents without website, blog and e-mail.
- * You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
- * Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.
Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.
Easy: Notes.io doesn’t require installation. Just write and share note!
Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )
Free: Notes.io works for 12 years and has been free since the day it was started.
You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;
Email: [email protected]
Twitter: http://twitter.com/notesio
Instagram: http://instagram.com/notes.io
Facebook: http://facebook.com/notesio
Regards;
Notes.io Team