Notes

Notes - notes.io

Find nulls PySpark
replica rolex deutschland

This is part two of a series covering the use of PySpark for a clustering project in Google Colab. The series will cover reading in data, setting up a Spark environment in Colab, using PySpark to clean data, building a model, and deploying that model. For the rest of the series please check out my posts or the links directly in the references below.
This article covers cleaning and model building the data used will be carried over from the last article and without further ado, the work begins below.
Part I of the series covered setting up an environment to run PySpark in Google Colab and the code to make an API call for the data used:
End to End PySpark Clustering: Part I Using Colab for PySpark and Collecting DataUsing Google Colab for running PySpark clustering analysis. medium.com
Data Cleaning Part of the cleaning is done before the data is used at all. This covers things like replacing missing values and checking the data types and structures. Other steps are typical to data cleaning or preprocessing but these will be taken care of in the modeling step through the use of a pipeline.
In PySpark there are two kinds of empty values that can cause an error. The first of these is a null value. These are the values for which there is no data. A check can be done to find these values by running a select() count() of only named columns when() the value isnull() according to the pyspark.sql.functions of the same names.
The second kind of empty value comes from the reading in of numeric data. The data we have contains three numeric columns and empty values here could be read in as NaN (Not a Number) values. Usually these would be dealth with by imputing values in their place or via deletion. Common methods include row deletion and imputing with the average value for the column. The NaN values can be checked for using the isnan() function.
The final step in the cleaning of the data is detecting outliers. The problem with outliers in clustering models is their ability to skew even scaled data to form their own cluster of one. The dataset we are using contains a range of very small and very large bodies. The detection of outliers is done manually using visualisation which finds Jupiter to be a severe outlier. The escape velocity, density and gravity are all orders of magnitude larger than almost every other body. For this reason, Jupiter is removed from the dataset to avoid it biasing the algorithm toward building a Jupiter-only cluster.
The chart showing the outlier is given below with instructions to plot both the before and after in the above snippet:
Model Building The model building step involves generating a pipeline. A pipeline can be used in PySpark to solve the preprocessing and training steps in one code block and structure. The pipeline used here will first generate a feature vector (a vector which is the combination of our training features in this case velocity, density and gravity), using a VectorAssembler(). The vector is generated because all work done in PySpark machine learning requires our data to be vectorised. The second step is scaling the data a StandardScaler() this will force all our data into a similar feature space. In so doing all the bodies will be more comparable to one another as the data are all scaled around the mean values. Finally, the KMeans() model is declared and all three of these processes are combined in a Pipeline().
The problem when using KMeans clustering is the unsupervised nature of the algorithm which gives us little control over the actual modelling process. The algorithm will cluster the data according to a set k value which is set beforehand. To find the best k value silhouette scoring can be performed. Silhouette scores are an indication of how dense and well-defined clusters are, the higher the silhouette score the better. The below code excerpt shows how the clusters are generated and an output box of the scores in this project.
The optimal k value for dense and well-defined clusters here is 2. However, there must be a tradeoff as by clustering into only two clusters the insight drawn from the model is significantly limited. There is a slight reduction in silhouette score as k is increased to 3 however three clusters should give more insight into the underlying data structures. For this reason the value of k=3 is chosen.
Evaluation The model built can now be evaluated and the way this is done is through visualisation and the interpretations that can be drawn. Using the three dimensions the model is based upon, the predicted clusters and the status of the body (whether or not it is a planet), the following code snippet can build a 3D chart.
The chart shows how well-defined the clusters are, with the bodies being split into large planetary bodies (Saturn and Neptune for example), smaller planetary bodies (including Earth and Mars) and the non-planets (like the Moon and Mars satellite Phobos). This model could now be used in practice to generate an unsupervised classification as to what kind of body something is once we know those measurable features like density, escape velocity and gravity. If other data has been used to cluster a different set of clusters could be produced perhaps to further classify those other bodies. This is an example of how future work could be done to refine the model built here.
However this refining is something for another time. The third and final part of this tutorial will focus on bringing the model into production so that stakeholders can view and interact with the modelling process in a more hands-on fashion. Please do stick around for more!
Useful Links GitHub Repo for this series:
GitHub - josephlewisjgl/pyspark_clustering: The code for the PySpark Clustering series of posts on…You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or… github.com
Solar System API for data used: https://api.le-systeme-solaire.net/en/
Part I of the series:
End to End PySpark Clustering: Part I Using Colab for PySpark and Collecting DataUsing Google Colab for running PySpark clustering analysis. medium.com
References PySpark KMeans docs: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html
Useful article on Silhouette Scoring: https://medium.com/@cmukesh8688/silhouette-analysis-in-k-means-clustering-cefa9a7ad111

My Website: https://uhren.su

Notes.io is a web-based application for taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000 notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 12 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes