Notes

Notes - notes.io

First Try of Predicting Salary

For the last two questions regarding what are related to relationships of variables with salary and job satisfaction - Each of these questions will involve not only building some sort of predictive model, but also finding and interpretting the influential components of whatever model we build.

To get started let's read in the necessary libraries and take a look at some of our columns of interest.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score, mean_squared_error

import WhatHappened as t

import seaborn as sns

%matplotlib inline

df = pd.read_csv('./survey_results_public.csv')

df.head()

Now take a look at the summary statistics associated with the quantitative variables in your dataset.

df.describe()

Question 1

1. Use the above to match each variable (a*, *b, c*, *d, e*, or *f) as the appropriate key that describes the value in the desc_sol dictionary.

a = 40

b = 'HoursPerWeek'

c = 'Salary'

d = 'Respondent'

e = 10

f = 'ExpectedSalary'

desc_sol = {'A column just listing an index for each row': #letter here,

'The maximum Satisfaction on the scales for the survey': #letter here,

'The column with the most missing values': #letter here,

'The variable with the highest spread of values': #letter here}

# Check your solution

t.describe_check(desc_sol)

A picture can often tell us more than numbers.

df.hist();

Often a useful plot is a correlation matrix - this can tell you which variables are related to one another.

sns.heatmap(df.corr(), annot=True, fmt=".2f");

Question 2

2. Use the scatterplot matrix above to match each variable (a*, *b, c*, *d, e*, *f, or g*) as the appropriate key that describes the value in the *scatter_sol dictionary.

a = 0.65

b = -0.01

c = 'ExpectedSalary'

d = 'No'

e = 'Yes'

f = 'CareerSatisfaction'

g = -0.15

scatter_sol = {'The column with the strongest correlation with Salary': #letter here,

'The data suggests more hours worked relates to higher salary': #letter here,

'Data in the ______ column meant missing data in three other columns': #letter here,

'The strongest negative relationship had what correlation?': #letter here}

t.scatter_check(scatter_sol)

Here we move our quantitative variables to an X matrix, which we will use to predict our response. We also create our response. We then split our data into training and testing data. Then when starting our four step process, our fit step breaks.
Remember from the Video, this code will break!

# Consider only numerica variables

X = df[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]

y = df['Salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)

#Four steps:

#Instantiate

lm_model = LinearRegression(normalize=True)

#Fit - why does this break?

lm_model.fit(X_train, y_train)

#Predict

#Score

Question 3

3. Use the results above to match each variable (a*, *b, c*, *d, e*, or *f ) as the appropriate key that describes the value in the lm_fit_sol dictionary.

a = 'it is a way to assure your model extends well to new data'

b = 'it assures the same train and test split will occur for different users'

c = 'there is no correct match of this question'

d = 'sklearn fit methods cannot accept NAN values'

e = 'it is just a convention people do that will likely go away soon'

f = 'python just breaks for no reason sometimes'

lm_fit_sol = {'What is the reason that the fit method broke?': #letter here,

'What does the random_state parameter do for the train_test_split function?': #letter here,

'What is the purpose of creating a train test split?': #letter here}

t.lm_fit_check(lm_fit_sol)

Notes.io is a web-based application for taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000 notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 12 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes