Introduction:
In unraveling this intricate problem, we delve into a detailed examination of Sparkify's data to decipher underlying patterns and insights crucial for predicting customer churn. Sparkify, a digital music service akin to industry behemoths like Spotify and YouTube Music, offers a diverse user experience, spanning subscription-based models, ad-supported free tiers, and features like playlist curation and social connections.
Approach:
The crux of our strategy lies in a methodical exploration of the dataset. Initiating this endeavor involves a profound dive into the data's intricacies, framing pertinent inquiries as our guiding compass. The subsequent phase entails a comprehensive investigation to extract meaningful answers and discern latent patterns. Equipped with these insights, we embark on the model training phase, employing a spectrum of algorithms to pinpoint the most effective one in forecasting customer churn. This systematic journey ensures a holistic grasp of the challenge and empowers us to make judicious decisions in navigating the intricacies of churn prediction.
Some information about my solution:
Overview about my project
This project serves as an illustrative example of addressing a large-scale machine learning challenge.
The dataset at hand comprises numerous user events within an audio streaming service provider, akin to Spotify. The objective is to identify whether a particular user is likely to cancel the service, leveraging their interactions with the platform for this prediction.
Python libraries
- pandas
- matplotlib
- numpy
- datetime
- seaborn
- pyspark
- time
My solution to analysis and running model
- Import packages
- Data gathering
- Data Preprocessing
- EDA
- Feature Engineering
- Modeling
Metric to evaluate
Model to predict
- Logistic Regression
Logistic Regression is a statistical method commonly used for binary classification, where the outcome is a categorical variable with two possible classes. Despite its name, Logistic Regression is primarily used for classification rather than regression tasks.
- Random Forest
Random Forest is a powerful ensemble learning method used for both classification and regression tasks. It belongs to the family of tree-based algorithms and is known for its robustness and high predictive accuracy.
- Gradient Boosting
Gradient Boosting is another powerful ensemble learning technique, often used for both classification and regression tasks. It is a machine learning algorithm that builds a predictive model in a stage-wise fashion, combining the predictions of multiple weak learners to create a strong learner.
- Linear Support Vector Machine
Linear Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression tasks. SVM is particularly powerful in solving linearly separable classification problems, where it aims to find the optimal hyperplane that best separates different classes in the feature space.
Some result during analysis
The information root dataframe: 286500 rows and 18 columns.
Statictis about amount null and empty value for each columns in dataframe:
- With column artist have 58392 null values and have 0 empty value
- With column auth have 0 null values and have 0 empty value
- With column firstName have 8346 null values and have 0 empty value
- With column gender have 8346 null values and have 0 empty value
- With column itemInSession have 0 null values and have 0 empty value
- With column lastName have 8346 null values and have 0 empty value
- With column length have 58392 null values and have 0 empty value
- With column level have 0 null values and have 0 empty value
- With column location have 8346 null values and have 0 empty value
- With column method have 0 null values and have 0 empty value
- With column page have 0 null values and have 0 empty value
- With column registration have 8346 null values and have 0 empty value
- With column sessionId have 0 null values and have 0 empty value
- With column song have 58392 null values and have 0 empty value
- With column status have 0 null values and have 0 empty value
- With column ts have 0 null values and have 0 empty value
- With column userAgent have 8346 null values and have 0 empty value
- With column userId have 0 null values and have 8346 empty value
Data Preprocessing:
Within the same records, all null values are present for the variables related to artists, length, and songs. I've opted not to eliminate these data points as they could still be associated with user behaviors, even in instances where the user is not actively engaged in listening to music.
EDA
Dataframe have 52 userId about churn group.
With the 'level' feature, there are two values: 'free' and 'paid.' In this context, the number of users with a 'free' level is higher than those with a 'paid' level
With the 'gender' feature, there are two values: 'F' and 'M'. In there, F means is female and M is Male. In this context, the number of users with a 'Male' gender is higher than those with a 'Female' gender
Below are some charts illustrating the correlation between gender and level with respect to the user status:
Some result after running model
Process:
- Read data from DataFrame and select important features.
- Divide the data into training set, test set and validation set.
- Train the Logistic Regression model and evaluate performance using F1 Score.
- Train the Random Forest model and evaluate performance using F1 Score.
- Train the Logistic SVM model and evaluate performance using F1 Score.
- Train the GBoosting model and evaluate performance using F1 Score.
Result:
Model | F1 Score | Time training (seconds) |
---|---|---|
Logistic Regression | 0.73 | 700 |
Logistic SVM | 0.67 | 3590 |
Random Forest | 0.71 | 348 |
GBoosting | 0.69 | 1209 |
The best model to predict churn customer is Logistic Regression with F1 Score is 0.73 and time to train is 700 seconds. However, I see that Random Forest also is a strong model because time to train is faster and even score only smaller a little. Therefore, in the context of the Udacity Data Scientist Sparkify Capstone Project, Random Forest could be an alternative or supplementary model to consider alongside Gradient Boosting and Logistic Regression for predicting customer churn. Its ability to handle complex relationships in data and provide insights into feature importance makes it a valuable tool in the data scientist's toolkit.
Conclusion and next action
Forecasting churn is a captivating yet challenging endeavor with the potential to enhance company operations. To gain deeper insights into the dataset, I delved into a Sparkify subset dataset, conducting a thorough analysis. Among various models, the Logistic Regression prediction model emerged as the most effective, boasting an impressive F1 score of 73%.
To elevate the model's predictive performance, an option is to train it using the complete dataset. And I need experimental many ML model like embedded model (LightGBM...), XGBoost...
Furthermore, recognizing that the existing features of the model lack influence on the churn rate, there arises a need to pinpoint and incorporate new features for refinement.
Reference:
- https://spark.apache.org/docs/1.4.1/ml-features.html
- https://sparkbyexamples.com/machine-learning/confusion-matrix-in-machine-learning/
- https://spark.apache.org/docs/latest/ml-classification-regression.html#classification
- https://www.educative.io/answers/what-is-the-f1-score
- https://stackoverflow.com/questions/41032256/get-same-value-for-precision-recall-and-f-score-in-apache-spark-logistic-regres
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html
Thank you
My email: nguyennhan8521@gmail.com and my repo github for this project is My repo