- vừa được xem lúc

Churn Customer Prediction

0 0 7

Người đăng: nguyen trong nhan

Theo Viblo Asia

Introduction:

In unraveling this intricate problem, we delve into a detailed examination of Sparkify's data to decipher underlying patterns and insights crucial for predicting customer churn. Sparkify, a digital music service akin to industry behemoths like Spotify and YouTube Music, offers a diverse user experience, spanning subscription-based models, ad-supported free tiers, and features like playlist curation and social connections.

Approach:

The crux of our strategy lies in a methodical exploration of the dataset. Initiating this endeavor involves a profound dive into the data's intricacies, framing pertinent inquiries as our guiding compass. The subsequent phase entails a comprehensive investigation to extract meaningful answers and discern latent patterns. Equipped with these insights, we embark on the model training phase, employing a spectrum of algorithms to pinpoint the most effective one in forecasting customer churn. This systematic journey ensures a holistic grasp of the challenge and empowers us to make judicious decisions in navigating the intricacies of churn prediction.

Some information about my solution:

Overview about my project

This project serves as an illustrative example of addressing a large-scale machine learning challenge.

The dataset at hand comprises numerous user events within an audio streaming service provider, akin to Spotify. The objective is to identify whether a particular user is likely to cancel the service, leveraging their interactions with the platform for this prediction.

Python libraries

  • pandas
  • matplotlib
  • numpy
  • datetime
  • seaborn
  • pyspark
  • time

My solution to analysis and running model

  • Import packages
  • Data gathering
  • Data Preprocessing
  • EDA
  • Feature Engineering
  • Modeling

Metric to evaluate

Source: https://towardsdatascience.com/a-single-number-metric-for-evaluating-object-detection-models-c97f4a98616d

Model to predict

  • Logistic Regression

Logistic Regression is a statistical method commonly used for binary classification, where the outcome is a categorical variable with two possible classes. Despite its name, Logistic Regression is primarily used for classification rather than regression tasks.

  • Random Forest

Random Forest is a powerful ensemble learning method used for both classification and regression tasks. It belongs to the family of tree-based algorithms and is known for its robustness and high predictive accuracy.

  • Gradient Boosting

Gradient Boosting is another powerful ensemble learning technique, often used for both classification and regression tasks. It is a machine learning algorithm that builds a predictive model in a stage-wise fashion, combining the predictions of multiple weak learners to create a strong learner.

  • Linear Support Vector Machine

Linear Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression tasks. SVM is particularly powerful in solving linearly separable classification problems, where it aims to find the optimal hyperplane that best separates different classes in the feature space.

Some result during analysis

The information root dataframe: 286500 rows and 18 columns.

Statictis about amount null and empty value for each columns in dataframe:

  • With column artist have 58392 null values and have 0 empty value
  • With column auth have 0 null values and have 0 empty value
  • With column firstName have 8346 null values and have 0 empty value
  • With column gender have 8346 null values and have 0 empty value
  • With column itemInSession have 0 null values and have 0 empty value
  • With column lastName have 8346 null values and have 0 empty value
  • With column length have 58392 null values and have 0 empty value
  • With column level have 0 null values and have 0 empty value
  • With column location have 8346 null values and have 0 empty value
  • With column method have 0 null values and have 0 empty value
  • With column page have 0 null values and have 0 empty value
  • With column registration have 8346 null values and have 0 empty value
  • With column sessionId have 0 null values and have 0 empty value
  • With column song have 58392 null values and have 0 empty value
  • With column status have 0 null values and have 0 empty value
  • With column ts have 0 null values and have 0 empty value
  • With column userAgent have 8346 null values and have 0 empty value
  • With column userId have 0 null values and have 8346 empty value

Data Preprocessing:

Within the same records, all null values are present for the variables related to artists, length, and songs. I've opted not to eliminate these data points as they could still be associated with user behaviors, even in instances where the user is not actively engaged in listening to music.

EDA

Dataframe have 52 userId about churn group.

With the 'level' feature, there are two values: 'free' and 'paid.' In this context, the number of users with a 'free' level is higher than those with a 'paid' level

With the 'gender' feature, there are two values: 'F' and 'M'. In there, F means is female and M is Male. In this context, the number of users with a 'Male' gender is higher than those with a 'Female' gender

Below are some charts illustrating the correlation between gender and level with respect to the user status:

Some result after running model

Process:

  • Read data from DataFrame and select important features.
  • Divide the data into training set, test set and validation set.
  • Train the Logistic Regression model and evaluate performance using F1 Score.
  • Train the Random Forest model and evaluate performance using F1 Score.
  • Train the Logistic SVM model and evaluate performance using F1 Score.
  • Train the GBoosting model and evaluate performance using F1 Score.

Result:

Model F1 Score Time training (seconds)
Logistic Regression 0.73 700
Logistic SVM 0.67 3590
Random Forest 0.71 348
GBoosting 0.69 1209

The best model to predict churn customer is Logistic Regression with F1 Score is 0.73 and time to train is 700 seconds. However, I see that Random Forest also is a strong model because time to train is faster and even score only smaller a little. Therefore, in the context of the Udacity Data Scientist Sparkify Capstone Project, Random Forest could be an alternative or supplementary model to consider alongside Gradient Boosting and Logistic Regression for predicting customer churn. Its ability to handle complex relationships in data and provide insights into feature importance makes it a valuable tool in the data scientist's toolkit.

Conclusion and next action

Forecasting churn is a captivating yet challenging endeavor with the potential to enhance company operations. To gain deeper insights into the dataset, I delved into a Sparkify subset dataset, conducting a thorough analysis. Among various models, the Logistic Regression prediction model emerged as the most effective, boasting an impressive F1 score of 73%.

To elevate the model's predictive performance, an option is to train it using the complete dataset. And I need experimental many ML model like embedded model (LightGBM...), XGBoost...

Furthermore, recognizing that the existing features of the model lack influence on the churn rate, there arises a need to pinpoint and incorporate new features for refinement.

Reference:

Thank you

My email: nguyennhan8521@gmail.com and my repo github for this project is My repo

Bình luận

Bài viết tương tự

- vừa được xem lúc

EDA dữ liệu cuộc thi Bookingchallenge và Baseline model

Xin chào mọi người, cách đây khoảng 2 tháng mình có tham gia một cuộc thi về recommendation system do Booking.com tổ chức, hôm nay mình sẽ chia sẻ bài viết về cách mình đã phân tích dữ liệu như thế nà

0 0 26

- vừa được xem lúc

Làm sao để trích xuất tính năng từ Dates bằng Python?

Xin chào mọi người hôm nay mình sẽ viết bài về cách lấy thêm tính năng từ bộ dữ liệu Time Series bằng code python. Nào chúng ta cùng bắt đầu thôi.

0 0 205

- vừa được xem lúc

Ta thấy được gì từ dữ liệu của Không lực Hoa Kỳ về các phi vụ trong Chiến tranh Việt Nam?

Chuẩn bị dữ liệu. Chúng ta sẽ sử dụng dữ liệu Vietnam War Bombing Operations.

0 0 23

- vừa được xem lúc

Explore dữ liệu với các thư viện chỉ bằng những dòng code đơn giản.

Xin chào các bạn, hôm nay mình sẽ tiếp bước bài viết Exploring dữ liệu chỉ một dòng code . Ở bài viết này mình cũng chỉ dùng một vài dòng code đơn giản để khám phá dữ liệu của mình đang có.

0 0 21

- vừa được xem lúc

Điểm tin AI tuần qua: 27/02/2023 - 05/03/2023

Chương trình điểm tin hàng tuần của AI Research được xây dựng nhằm mục đích giúp bạn cập nhật các xu hướng mới nhất. Các hoạt động đáng chú ý trong tuần:.

0 0 15

- vừa được xem lúc

PySpark với một project Machine Learning nho nhỏ

Trong không khi người người MayFest, nhà nhà MayFest, tiếp nối series tự học và khám phá về Data Sience, trong bài viết hôm nay mình sẽ chia sẻ cùng mọi người kiến thức cơ bản cũng như thực hành về Sp

0 0 12