Data Preparation - Churn datasets

First published on February 28, 2022

Last updated at April 19, 2022


13 minute read

Rupesh C.


Increasing customer retention is one of the biggest challenges subscription-based businesses and retailers are facing. While there are several things you can do to improve retention rates, predicting customers at risk of leaving and changing their minds is one of the most cost-efficient ways to do so.


  • Introduction

  • Why do Customers Churn?

  • Churn Prediction 

  • Churn Rate

  • Different ways of Churn Prediction

  • ML-based Churn Prediction

  • Churn Prediction using Mage

  • Conclusion


Subscription businesses involve selling a product or service and collecting recurring revenue for continuing to provide that service or product. Most subscription businesses charge either monthly or yearly. Over the past few years, industries everywhere have started to adopt subscription business models. The main reason behind this is the concept of making small, regular payments instead of the high initial costs. 

To increase the profit companies need to get recurring revenue continuously from each customer. If some customers stop paying subscription charges, this results in loss to the company. To overcome this, subscription oriented industries often use churn prediction to analyze customer behavior and predict which customers will stop using the service soon. The technology is based on machine learning methods and is becoming increasingly important for businesses, as the cost of acquiring a new customer is often more than the cost of retaining an existing one.

Individualized customer retention is difficult because subscription-based businesses usually have a lot of customers and cannot afford to spend much time on each one. The costs would be too high and would outweigh the extra revenue. However, if you could predict in advance which customers are at risk of leaving, you could reduce customer retention efforts by directing company resources solely toward such customers.

Churn Prevention (Source: Atrium)

Why do Customers Churn?

Have you ever wondered why your customers stop buying your products or stop using your services? This is called customer churn. Some of the major reasons for customer churn are:

  • Bad Customer service

  • Price

  • Lack of communications

  • Lack of Innovation

  • Without a genuine understanding of your customers, treating them all the same

This is why you need to take advantage of customer churn prediction. Partly that’s related to the huge amount of possibilities on the market. Choose any product or service and we bet you can find at least several competitors. In some businesses, like FMCG or fashion, those competitors are counted in hundreds. No wonder, that customers are purchasing products and services in many different places. This is how the modern world works, but that doesn’t change the simple fact that no company wants to lose customers.

Reasons for customer churn (Source: Kron4)

Churn Prediction


Churn prediction is the process of identifying customers, who are likely to cancel their subscriptions based on their past behavior. In other words, for each customer, we are interested in knowing "will this customer leave within the next month?"

Examples of customer churn

  • Cancellation of a subscription.

  • Closure of an account.

  • Non-renewal of a contract or service agreement.

  • Use another service provider

Churn Rate


The churn rate is one of the critical performance indicators for subscription businesses. ML-based churn prediction is very popular among modern service providers.

Churn Rate (Source: Tractionwise)

Industries affected by Churn

  • Telecom Companies (cable or wireless): 

These companies may provide a full range of products and services, including wireless network, internet, TV, cell phone, and home phone services (AT&T, Sprint, Verizon, T-mobile, etc…). There are many competitors in these industries, so they experience churn quite often. With a large amount of historical data of customers with them, they are using it to detect churn. This data contains the personal information of customers. Sharing this data directly will be a breach of data privacy, hence companies use data encryption techniques to hide sensitive information from others. For research purposes, IBM has provided a 

of a fictional telco company, which provides home phone and internet services.

  • E-commerce Companies: 

The business of these companies is buying and selling goods and services over the Internet. It is conducted over computers, tablets, smartphones, and other smart devices. Almost anything can be purchased through e-commerce today. For e-commerce businesses, customer churn means that customers stop buying from their stores. Amazon and eBay are some famous companies in this industry. We have a public 

that belongs to an online e-commerce company, which includes 20 features describing customer information e.g. gender, number of orders placed, number of coupons used, any complaints raised by the customer, etc. From this, retail (e-commerce) companies want to know which customers are going to churn and approach to offer some promos.

Churn impact (Source :Thinkapps)

  • Insurance Companies:Image source

These companies around the world operate in a  competitive environment, as there are many insurance firms on the market, and each company looks for the best way of selling their insurance products in the best possible way and targets a particular group of individuals. With various aspects of data collected from millions of customers, it's painstakingly hard to analyze and understand the reason for a customer's decision to switch to a different insurance provider. Customer acquisition and retention are important to growing a business because of the kind of competitiveness in this sector. From history, they found that customer acquisition is a more expensive process than retention. Hence, insurance companies focus more on customer retention by relying on data to understand customer behavior to prevent retention. Thus, knowing when a customer is going to switch beforehand allows insurance companies to come up with strategies to prevent it from actually happening. The public 

contains around 16 anonymized features that are responsible for customer churn. 

Different ways of Churn Prediction

There are 2 types of methods used for churn prediction:

  1. Rule-Based Churn Prediction:

    These're old techniques used for churn prediction, in which certain sets of rules are used to detect customer behavior. These methods are very much dependent on the knowledge and experience of the person who is setting the rules.

  2. Churn Prediction using Machine Learning (ML) techniques:

    Nowadays we've lots and lots of data available to service providers. Data is the main strength for machine learning models, the more the data the more accurate the model will be. Hence with large amounts of data, now machine learning methods are used mostly for churn prediction.

Modern solution (Source: Meming)

ML-based Churn Prediction

ML approach (Source: Towards Data Science)

For successful implementation of ML-based crunch prediction approach following are important steps:

A. Understanding a problem and final goal

Retaining your existing customers with the company for as long as possible is very challenging. To keep your customers satisfied, you have to provide what they need along with good customer service to maintain their subscription.

Define Objective (Source: Tenor) 

We need to decide which question we’re going to ask each customer. For example, is this customer going to churn? When will the customer churn?

B. Data Gathering

Once we finalize the problem statement, we need data for training the machine learning model. The most common data sources for predicting customer churn are:

  • Customer Relationship Management systems (including sales and customer support records) 

  • Analytics services (e.g Google Analytics) 

  • Reviews on social media 

To get better performance from an ML model, data needs to contain as much information about customers as possible. Each piece of customer information is called a feature. The more customer features you gather, the more accurate your model will be. For customer churn prediction, there are five main areas to focus on:

  • Define churn window

    – Customer churn prediction is about finding customers who may leave the company after a certain time. When collecting data, you need to first fix that time window. i.e. will the customer churn within one month, six months, or one year? Then we collect data for each customer for the previous time. It’s extremely important to understand that each customer in the data is represented as a “snapshot” of customer taken before that time so that we'd associate this snapshot with the fact now, and assign it a label of whether the customer has churned or not. Because of this, when calculating the value of the feature, we must be very careful not to take in any information about the customer which was available during that time window (including his usage of the service).

    Churn window (Source: Alteryx)

  • Customer Features

    – This data relates to the individual characteristics of each customer. It can include anything from age and gender to education level and income. These features are helpful to group the person’s having similar features.

  • Support Features

    – Information related to how customers interact with your customer support. This data can include how often they contacted your support staff or the subjects of their queries. These features give ideas about customers facing some problems.

Support features (Source: 123rf)

  • Usage Features

    – Any data you can gather on how each customer has used your service. For instance, how frequently they log in or when they last logged in. Collect info on how long they spend on your app or what actions they take when there, all the better. These features are helpful to finding patterns in service usage by customers. 

  • Contextual Features

    – This includes all other data you can gather. For example, the device they are using the company service, or customer support agent they’ve contacted the most.

If you want to build a churn prediction model and you don’t have time to follow the above steps to generate data, then there are some public datasets available, which can be used directly to train the ML model with some preprocessing.


This data contains information about a fictional telco company that provided home phone and Internet services. This dataset has historical information of 7043 customers for the time window of one month, along with a corresponding label (churn or not churn). Out of 7043 customers, a total of 1869 customers do churn. This dataset has a total of 20 columns from which we have to predict customer churn. This dataset is clean and doesn’t have any NA or Null values in it.


This dataset contains two CSV files (train.csv and test.csv). Train CSV file which contains 33908 customer's data, can be used for training the ML model and test CSV file which contains 11303 customer’s data, should be used for validating trained ML model. The train CSV file contains 3968 churn customers data. The dataset contains 16 anonymized features to maintain data privacy. Dataset doesn't have any Null or NA values in it.


This dataset contains 5630 customer historical data, having 20 features values for each customer. Out of 5630 customers, 948 customers had performed churn. Unlike the above two datasets, this dataset contains some Null values for some columns, which can be filled with various techniques. 

From the above three datasets, we can go with the Insurance churn dataset, as it has maximum examples and fewer features compared to other datasets, which will result in the good accuracy of the ML model.

Churn Prediction using Mage

You can use the ‘Churn Prediction’ use case from Mage. You just need to go with the following sequence:

  1. or 

    with your credentials, then follow the process.

  2. Select the ‘

    Build new

    ’ option from the left sidebar, to create a new model.

    Build Model (Source: Mage)

3. Next, select the ‘

Churn Prediction

’ prompt.

Churn Prediction use case (Source: Mage)

4. After adding the dataset, we select the churn column. Choose from a list which feature is for customer churn.

Churn column selection (Source: Mage)

5. Next step is model training.

Churn column selection (Source: Mage)

6. The Mage algorithm works as follows.

Just upload the data, select the churn column, and you're done. All other work is done by Mage internally.

Magical Solution (Source: Tenor)


Churn rate is a health indicator for subscription-based companies. The ability to identify customers that aren’t happy beforehand allows businesses to learn about product or pricing plan weak points, operation issues, as well as customer preferences and expectations to proactively reduce reasons for churn.

To run an organization that is mostly working on a subscription-based model is not easy. They must always retain their customers for growth. Therefore we can’t neglect the churn prediction, it helps to grow the company and also helps to identify the requirements of customers. 

From this blog, we get to know some publicly available datasets from different industries. What is the proper way to collect data and what data should we use to train the machine learning model. We can do some exploratory analysis to get insides from the dataset which will help us to remove unwanted features and do some preprocessing to improve data quality before feeding data to the machine learning model. 

By using the Mage product, you can get a trained machine learning model by just passing data to the Mage product and selecting the use case.

Want to learn more about machine learning (ML)? Visit 

! ✨🔮