Notes

Notes - notes.io

For ML question:-
Step 1:- check which file we need to change i.e, .ipynb(most probably in this file or .py file

Step 2:- Change in file according to this below code:-

#Import Libraries and Setup AWS SDK:
import boto3
import pandas as pd
import sagemaker
from sagemaker import get_execution_role

# Initialize session and role
session = boto3.Session()
sagemaker_session = sagemaker.Session()
role = get_execution_role()

# S3 bucket details
bucket_name = 'employee-data12345'

#Build S3 Path and Load Data:

# S3 path for the dataset ((CHANGE REQUIRED))
s3_path = f's3://{bucket_name}/inputfiles/employee_cleaned_data.csv'

# Load dataset
df = pd.read_csv(s3_path)

# Remove unique identifier column
df = df.drop(columns=['employee_id'])

# Extract numeric values from 'region' column
df['region'] = df['region'].str.extract('(d+)').astype(int)

# Display first few rows of the dataframe
df.head()

#Analyze and Visualize Data
import seaborn as sns
import matplotlib.pyplot as plt

# Check for duplicates ((CHANGE REQUIRED))
duplicate_count = df.duplicated().sum()
print(f'Number of duplicate records: {duplicate_count}')

# DataFrame shape after cleaning ((CHANGE REQUIRED))
print(f'DataFrame shape: {df.shape}')

# Pie chart for 'gender' column
gender_counts = df['gender'].value_counts()
gender_counts.plot.pie(autopct='%1.1f%%', figsize=(8, 8),
title='Gender Distribution')
plt.show()

# Count plot for 'education' column categorized by gender
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='education', hue='gender')
plt.title('Education Level Distribution by Gender')
plt.show()

#Feature Engineering
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression

# Define target variable
target_column = 'awards_won'

# Split data into X and y
X = df.drop(columns=[target_column])
y = df[target_column]

# Define the actual categorical and numerical columns
categorical_columns = ['department', 'region', 'education', 'gender',
'recruitment_channel']
numerical_columns = ['no_of_trainings', 'age', 'previous_year_rating',
'length_of_service', 'KPIs_met_more_than_80', 'avg_training_score']

# Define column transformer
column_transformer = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), categorical_columns),
('num', StandardScaler(), numerical_columns)
]
)

# Apply column transformer
X_transformed = column_transformer.fit_transform(X)

# Feature selection
selector = SelectKBest(score_func=f_regression, k=5)
X_selected = selector.fit_transform(X_transformed, y)

# Verify selected columns ((CHANGE REQUIRED))
print(f'Selected features shape: {X_selected.shape}')

#Creating the Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y,
test_size=0.2, random_state=0)
# Since X_selected is already scaled, no need to apply StandardScaler
again

# Build and train the model
model = LogisticRegression(random_state=0)
model.fit(X_train, y_train)

# Predict values
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print evaluation metrics ((CHANGE REQUIRED))
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

# Confusion matrix heatmap
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
#Deploying a Machine Learning Model
import joblib
import tempfile
import boto3

# Serialize the model to a file
model_filename = 'model.pkl'

# Create a temporary file and save the model to it
with tempfile.TemporaryFile() as temp_file:
joblib.dump(model, temp_file)
temp_file.seek(0)

# Upload the model file to S3 ((CHANGE REQUIRED))
s3_client = boto3.client('s3', region_name='us-east-1')
s3_client.upload_fileobj(temp_file, bucket_name, f'ml-
output/{model_filename}')
print(f'Successfully pushed data to S3: {model_filename}')
Prediction using the Deployed Model

# Download the model file from S3 and load it ((CHANGE REQUIRED))
with tempfile.TemporaryFile() as temp_file:
s3_client.download_fileobj(bucket_name, f'mloutput/{model_filename}', temp_file)
temp_file.seek(0)
loaded_model = joblib.load(temp_file)

# Use the loaded model for predictions
y_pred_new = loaded_model.predict(X_test)

# Evaluate the loaded model
accuracy_new = accuracy_score(y_test, y_pred_new)
precision_new = precision_score(y_test, y_pred_new)
recall_new = recall_score(y_test, y_pred_new)
f1_new = f1_score(y_test, y_pred_new)

# Print evaluation metrics ((CHANGE REQUIRED))
print(f'Accuracy: {accuracy_new}')
print(f'Precision: {precision_new}')
print(f'Recall: {recall_new}')
print(f'F1 Score: {f1_new}')

# Confusion matrix heatmap for the new predictions
conf_matrix_new = confusion_matrix(y_test, y_pred_new)
sns.heatmap(conf_matrix_new, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - New Predictions')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

for analytics(s3-->redshift--->glue)

Step 1:- First create cluster from redshift
Amazon redshift ---> Provisioned cluster dashboard ---> Create cluster ---> create a cluster according to question

Step 2:- Create crawler in AWS glue (for s3):-
AWS Glue ---> Crawler ---> Add Crawler ---> Create a crawler to according question(Note create that crawler in which s3 is using as data store ---> Run Crawler

After running one table will be created (location:- Glue-->databases-->at bottom we have tables in something)

Step 3:- Redshift Table creation In Amazon Redshift:-
Amazon Redshift ---> Cluster ---> Go to that cluster which we have created in first step ---> querry data ---> querry in querry editor

connect to database --> create new connection --> temprorary credentials --> cluster(choose redhift cluster) -->
database name(give as dev always) --> database user(give which we have created in redshift cluster)--> Connect--> run

now in querry 1 write:- create database databaseName(any name) then run --> change connection --> connect database(give which we have created in this step only)--> run

now copy data of create table from AWS documents(path:- Amazon redshift--> loading data from amazon s3 --> create the sample table)--> copy code

now in query 1 first line paste the code and where:- create table + table name(same table name which we have created in glue), after ({) braces in place of (p_part key) give name
as given in table(which is created in glue like id ) and in place of (Integer nor null) as given in table like (big int), similarly for all values present in table --> Run
(table will be created)

Step 4:- VPC End Point creation in VPC:-
VPC ---> Endpoints ---> Create Endpoint ---> create a vpc endpoint according to question

Srep 5:-Create connection in AWS Glue
AWS Glue ---> connections ---> Add Connection ---> create a connection according to question

Step 6:- Create crawler in AWS Glue(for Redshift)
AWS Glue ---> Crawler ---> Add Crawler ---> Create a crawler according to question ---> run (a table will be created)

Step 7:- Create a ETL job in AWS Glue:-

AWS Glue ---> ETL ---> Jobs ---> Add job ---> Name(etl) ---> all at it is ---> next---> Choose a datasource(s3 one) ---> next ---> choose a data target(red shift one)
save job ---> run job

Step 8:- quicksight:-
Quick sight --> datasets ---> new datasets ---> create a datset acc to question

Notes is a web-based application for online taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000+ notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 14 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes