Notes
![]() ![]() Notes - notes.io |
As you might have guessed from the last notebook, using all of the variables was allowing you to drastically overfit the training data. This was great for looking good in terms of your Rsquared on these points. However, this was not great in terms of how well you were able to predict on the test data.
We will start where we left off in the last notebook. First read in the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import AllTogether as t
import seaborn as sns
%matplotlib inline
df = pd.read_csv('./survey_results_public.csv')
df.head()
Question 1
1. To begin fill in the format function below with the correct variable. Notice each { } holds a space where one of your variables will be added to the string. This will give you something to do while the the function does all the steps you did throughout this lesson.
a = 'test_score'
b = 'train_score'
c = 'linear model (lm_model)'
d = 'X_train and y_train'
e = 'X_test'
f = 'y_test'
g = 'train and test data sets'
h = 'overfitting'
q1_piat = '''In order to understand how well our {} fit the dataset,
we first needed to split our data into {}.
Then we were able to fit our {} on the {}.
We could then predict using our {} by providing
the linear model the {} for it to make predictions.
These predictions were for {}.
By looking at the {}, it looked like we were doing awesome because
it was 1! However, looking at the {} suggested our model was not
extending well. The purpose of this notebook will be to see how
well we can get our model to extend to new data.
This problem where our data fits the training data well, but does
not perform well on test data is commonly known as
{}.'''.format(a, a, a, a, a, a, a, a, a, a) #replace a with the correct variable
print(q1_piat)
# Print the solution order of the letters in the format
t.q1_piat_answer()
Question 2
2. Now, we need to improve the model . Use the dictionary below to provide the true statements about improving this model. Also consider each statement as a stand alone. Though, it might be a good idea after other steps, which would you consider a useful next step?
a = 'yes'
b = 'no'
q2_piat = {'add interactions, quadratics, cubics, and other higher order terms': #letter here,
'fit the model many times with different rows, then average the responses': #letter here,
'subset the features used for fitting the model each time': #letter here,
'this model is hopeless, we should start over': #letter here}
#Check your solution
t.q2_piat_check(q2_piat)
Question 3
3. Before we get too far along, follow the steps in the function below to create the X (explanatory matrix) and y (response vector) to be used in the model. If your solution is correct, you should see a plot similar to the one shown in the Screencast.
def clean_data(df):
'''
INPUT
df - pandas dataframe
OUTPUT
X - A matrix holding all of the variables you want to consider when predicting the response
y - the corresponding response vector
Perform to obtain the correct X and y objects
This function cleans df using the following steps to produce X and y:
1. Drop all the rows with no salaries
2. Create X as all the columns that are not the Salary column
3. Create y as the Salary column
4. Drop the Salary, Respondent, and the ExpectedSalary columns from X
5. For each numeric variable in X, fill the column with the mean value of the column.
6. Create dummy columns for all the categorical variables in X, drop the original columns
'''
return X, y
#Use the function to create X and y
X, y = clean_data(df)
Run the Cell Below to Acheive the Results Needed for Question 4
#cutoffs here pertains to the number of missing values allowed in the used columns.
#Therefore, lower values for the cutoff provides more predictors in the model.
cutoffs = [5000, 3500, 2500, 1000, 100, 50, 30, 25]
#Run this cell to pass your X and y to the model for testing
r2_scores_test, r2_scores_train, lm_model, X_train, X_test, y_train, y_test = t.find_optimal_lm_mod(X, y, cutoffs)
Question 4
4. Use the output and above plot to correctly fill in the keys of the q4_piat dictionary with the correct variable. Notice that only the optimal model results are given back in the above - they are stored in lm_model, X_train, X_test, y_train, and y_test. If more than one answer holds, provide a tuple holding all the correct variables in the order of first variable alphabetically to last variable alphabetically.
# Cell for your computations to answer the next question
a = 'we would likely have a better rsquared for the test data.'
b = 1000
c = 872
d = 0.69
e = 0.82
f = 0.88
g = 0.72
h = 'we would likely have a better rsquared for the training data.'
q4_piat = {'The optimal number of features based on the results is': #letter here,
'The model we should implement in practice has a train rsquared of': #letter here,
'The model we should implement in practice has a test rsquared of': #letter here,
'If we were to allow the number of features to continue to increase': #letter here
}
#Check against your solution
t.q4_piat_check(q4_piat)
Question 5
5. The default penalty on coefficients using linear regression in sklearn is a ridge (also known as an L2) penalty. Because of this penalty, and that all the variables were normalized, we can look at the size of the coefficients in the model as an indication of the impact of each variable on the salary. The larger the coefficient, the larger the expected impact on salary.
Use the space below to take a look at the coefficients. Then use the results to provide the True or False statements based on the data.
Run the below to complete the following dictionary
def coef_weights(coefficients, X_train):
'''
INPUT:
coefficients - the coefficients of the linear model
X_train - the training data, so the column names can be used
OUTPUT:
coefs_df - a dataframe holding the coefficient, estimate, and abs(estimate)
Provides a dataframe that can be used to understand the most influential coefficients
in a linear model by providing the coefficient estimates along with the name of the
variable attached to the coefficient.
'''
coefs_df = pd.DataFrame()
coefs_df['est_int'] = X_train.columns
coefs_df['coefs'] = lm_model.coef_
coefs_df['abs_coefs'] = np.abs(lm_model.coef_)
coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)
return coefs_df
#Use the function
coef_df = coef_weights(lm_model.coef_, X_train)
#A quick look at the top results
coef_df.head(20)
a = True
b = False
#According to the data...
q5_piat = {'Country appears to be one of the top indicators for salary': #letter here,
'Gender appears to be one of the indicators for salary': #letter here,
'How long an individual has been programming appears to be one of the top indicators for salary': #letter here,
'The longer an individual has been programming the more they are likely to earn': #letter here}
t.q5_piat_check(q5_piat)
Congrats of some kind
Congrats! Hopefully this was a great review, or an eye opening experience about how to put the steps together for an analysis. List the steps. In the next lesson, you will look at how take this and show it off to others so they can act on it.
![]() |
Notes is a web-based application for online taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000+ notes created and continuing...
With notes.io;
- * You can take a note from anywhere and any device with internet connection.
- * You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
- * You can quickly share your contents without website, blog and e-mail.
- * You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
- * Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.
Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.
Easy: Notes.io doesn’t require installation. Just write and share note!
Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )
Free: Notes.io works for 14 years and has been free since the day it was started.
You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;
Email: [email protected]
Twitter: http://twitter.com/notesio
Instagram: http://instagram.com/notes.io
Facebook: http://facebook.com/notesio
Regards;
Notes.io Team