Predicting churn with user log data of music streaming service

Sparkify

4 min readSep 28, 2021

This project is intended to find whether a person is likely to churn or not using the log data provided. This blog post will walk you through the approach I took while building this project.

Introduction

The data was provided by Udacity as a part of DSND and was intended to simulate music streaming logs from a company, Sparkify just like Spotify or Pandora. The data was provided in two forms, medium-sized (~500 thousand rows ), which I used for the EDA purpose on the local machine and the larger one ( ~25 million rows ), which was used to create models on AWS EMR using PySpark.

The AWS EMR is not free so, I had to cut down on some operations that I didn’t run on the EMR cluster.

Reading Dataset

A glance at the data reveals that it has the following columns:

Also, the following column, “page” gives an idea of what action is performed by the user.

As a first step, I checked if any null values were present or not.

Exploratory Data Analysis (EDA)

I first checked the gender-based distribution, which did not give me a clear picture as there was not much skewness towards either of them.

In novice terms, how long a person is using the application could play a vital role in predicting churn.

This alone didn’t give us a clear picture. So, other factors that I considered to be important were the number of artists a person is listening to and the number of thumbs-ups a person is giving. The following plots show a clear trend.

In the end, to make the model perform even better, I added Add to Playlist count feature as well. The main intention was that the more the number of songs a person is having in the playlist, the lesser are the chances of churn.

We need a label feature (target feature) that will tell us the churn status. The label was a binary class feature with 0 (no churn) and 1 (churn) by checking whether a person is submitting a downgrade from paid to free or not.

Data Preparation

Before the modelling, data preparation needs to be done.

Collecting all the features into a features vector.
Scaling the feature vector.
Data is divided into train data (75%) and test data (25%)

Data Modelling

I decided to implement the following models:

Logistic Regression

Accuracy : 0.7687
F1 Score :0.6839

2. Random Forest Classifier

Accuracy : 0.7809
F1 Score :0.7449

3. Gradient Boosted Trees

Accuracy : 0.7730
F1 Score :0.7518

There’s a minor difference between the gradient-boosted tree classifier and the decision tree classifier. Considering the f1-score, we can say that the tree-based algorithm outperforms logistic regression.

For finalising the model, I decided to do hyperparameter tuning using K-fold Cross-Validation.

Conclusion

Unfortunately, there were two issues faced while this hyperparameter tuning was being performed:

The process was taking a lot of time, sometimes ending in a session timeout as well. The CrossValidation of GBT is a very expensive process
The f1-score didn’t have much difference.

Improvements

The model was getting under fitted after cross-validation. So it is better to go with the unoptimised one.
An increase in the dataset could lead to a better fit & more optimised model.
More features could be considered.
More classification algorithms could be tested which I couldn’t due to limited resources.
More values could be added to the parameter grid, but this could lead to the program execution to take a very long time.

To follow the in-depth approach of this analysis, see my GitHub repo here.