Churn Customer Prediction

Introduction:

In unraveling this intricate problem, we delve into a detailed examination of Sparkify's data to decipher underlying patterns and insights crucial for predicting customer churn. Sparkify, a digital music service akin to industry behemoths like Spotify and YouTube Music, offers a diverse user experience, spanning subscription-based models, ad-supported free tiers, and features like playlist curation and social connections.

Approach:

The crux of our strategy lies in a methodical exploration of the dataset. Initiating this endeavor involves a profound dive into the data's intricacies, framing pertinent inquiries as our guiding compass. The subsequent phase entails a comprehensive investigation to extract meaningful answers and discern latent patterns. Equipped with these insights, we embark on the model training phase, employing a spectrum of algorithms to pinpoint the most effective one in forecasting customer churn. This systematic journey ensures a holistic grasp of the challenge and empowers us to make judicious decisions in navigating the intricacies of churn prediction.

Some information about my solution:

Overview about my project

This project serves as an illustrative example of addressing a large-scale machine learning challenge.

The dataset at hand comprises numerous user events within an audio streaming service provider, akin to Spotify. The objective is to identify whether a particular user is likely to cancel the service, leveraging their interactions with the platform for this prediction.

Python libraries

pandas
matplotlib
numpy
datetime
seaborn
pyspark
time

My solution to analysis and running model

Import packages
Data gathering
Data Preprocessing
EDA
Feature Engineering
Modeling

Metric to evaluate

Source: https://towardsdatascience.com/a-single-number-metric-for-evaluating-object-detection-models-c97f4a98616d

Model to predict

Logistic Regression

Logistic Regression is a statistical method commonly used for binary classification, where the outcome is a categorical variable with two possible classes. Despite its name, Logistic Regression is primarily used for classification rather than regression tasks.

Random Forest

Random Forest is a powerful ensemble learning method used for both classification and regression tasks. It belongs to the family of tree-based algorithms and is known for its robustness and high predictive accuracy.

Gradient Boosting

Gradient Boosting is another powerful ensemble learning technique, often used for both classification and regression tasks. It is a machine learning algorithm that builds a predictive model in a stage-wise fashion, combining the predictions of multiple weak learners to create a strong learner.

Linear Support Vector Machine

Linear Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression tasks. SVM is particularly powerful in solving linearly separable classification problems, where it aims to find the optimal hyperplane that best separates different classes in the feature space.

Some result during analysis

The information root dataframe: 286500 rows and 18 columns.

Statictis about amount null and empty value for each columns in dataframe:

With column artist have 58392 null values and have 0 empty value
With column auth have 0 null values and have 0 empty value
With column firstName have 8346 null values and have 0 empty value
With column gender have 8346 null values and have 0 empty value
With column itemInSession have 0 null values and have 0 empty value
With column lastName have 8346 null values and have 0 empty value
With column length have 58392 null values and have 0 empty value
With column level have 0 null values and have 0 empty value
With column location have 8346 null values and have 0 empty value
With column method have 0 null values and have 0 empty value
With column page have 0 null values and have 0 empty value
With column registration have 8346 null values and have 0 empty value
With column sessionId have 0 null values and have 0 empty value
With column song have 58392 null values and have 0 empty value
With column status have 0 null values and have 0 empty value
With column ts have 0 null values and have 0 empty value
With column userAgent have 8346 null values and have 0 empty value
With column userId have 0 null values and have 8346 empty value

Data Preprocessing:

Within the same records, all null values are present for the variables related to artists, length, and songs. I've opted not to eliminate these data points as they could still be associated with user behaviors, even in instances where the user is not actively engaged in listening to music.

EDA

Dataframe have 52 userId about churn group.

With the 'level' feature, there are two values: 'free' and 'paid.' In this context, the number of users with a 'free' level is higher than those with a 'paid' level

With the 'gender' feature, there are two values: 'F' and 'M'. In there, F means is female and M is Male. In this context, the number of users with a 'Male' gender is higher than those with a 'Female' gender

Below are some charts illustrating the correlation between gender and level with respect to the user status:

Some result after running model

Process:

Read data from DataFrame and select important features.
Divide the data into training set, test set and validation set.
Train the Logistic Regression model and evaluate performance using F1 Score.
Train the Random Forest model and evaluate performance using F1 Score.
Train the Logistic SVM model and evaluate performance using F1 Score.
Train the GBoosting model and evaluate performance using F1 Score.

Result:

Model	F1 Score	Time training (seconds)
Logistic Regression	0.73	700
Logistic SVM	0.67	3590
Random Forest	0.71	348
GBoosting	0.69	1209

The best model to predict churn customer is Logistic Regression with F1 Score is 0.73 and time to train is 700 seconds. However, I see that Random Forest also is a strong model because time to train is faster and even score only smaller a little. Therefore, in the context of the Udacity Data Scientist Sparkify Capstone Project, Random Forest could be an alternative or supplementary model to consider alongside Gradient Boosting and Logistic Regression for predicting customer churn. Its ability to handle complex relationships in data and provide insights into feature importance makes it a valuable tool in the data scientist's toolkit.

Conclusion and next action

Forecasting churn is a captivating yet challenging endeavor with the potential to enhance company operations. To gain deeper insights into the dataset, I delved into a Sparkify subset dataset, conducting a thorough analysis. Among various models, the Logistic Regression prediction model emerged as the most effective, boasting an impressive F1 score of 73%.

To elevate the model's predictive performance, an option is to train it using the complete dataset. And I need experimental many ML model like embedded model (LightGBM...), XGBoost...

Furthermore, recognizing that the existing features of the model lack influence on the churn rate, there arises a need to pinpoint and incorporate new features for refinement.

Reference:

Thank you

My email: nguyennhan8521@gmail.com and my repo github for this project is My repo

Churn Customer Prediction

Introduction:

Approach:

Some information about my solution:

Overview about my project

Python libraries

My solution to analysis and running model

Metric to evaluate

Model to predict

Some result during analysis

The information root dataframe: 286500 rows and 18 columns.

Data Preprocessing:

EDA

Some result after running model

Process:

Result:

Conclusion and next action

Reference:

Thank you

Bình luận

Bài viết tương tự

EDA dữ liệu cuộc thi Bookingchallenge và Baseline model

Làm sao để trích xuất tính năng từ Dates bằng Python?

Ta thấy được gì từ dữ liệu của Không lực Hoa Kỳ về các phi vụ trong Chiến tranh Việt Nam?

Explore dữ liệu với các thư viện chỉ bằng những dòng code đơn giản.

Điểm tin AI tuần qua: 27/02/2023 - 05/03/2023

Tại sao cần Event-Driven Architecture?