Guide to Churn Prediction: Part 1 - Gather & Clean

December 16, 2021 · 12 minute read

Jahnavi C.


In this blog we'll implement the first 3 steps on a customer dataset for Churn Prediction.

By now you should be familiar with all the steps required to build a machine learning model. If you didn’t get a chance to go through it, feel free to check out this


, where I explained in detail all the steps required to build and optimize a ML model.



  • Intro

  • Defining the Objective

  • Data Gathering

  • Data Cleaning


A Telecom company recently noticed that many of their

customers churned

unexpectedly. If this continues, the company will incur huge losses. So they immediately worked on a plan to retain their customers.

  1. To

    reduce prices

    and offer special discounts to


    their customers in order to retain the customers who might churn. The problem with this solution is that, giving huge discounts to all the customers may cause


    or the company may produce very little profit.

  2. The company has a huge amount of

    past data

    about the customers including the

    customers who left

    , so they thought that analyzing their past data may help them


    which customer is likely to churn and if a special offer can retain them. In hopes to reduce additional losses, the company decided to go with the second solution and approached a

    data scientist 

    to help them analyze the data and to predict the customers likely to churn.

So without any further delay, let’s dive into the world of data scientists and see how they approach this problem to predict customer churn.

Step 1: Define the Objective

It’s a telecommunications company that provides

home phone



services to residents in the USA.

The company noticed that their customers have been churning for a while. And this has impacted their customer base and business revenue, hence they need a plan to retain their customers.


Build a machine learning


to identify/

predict the customers 

who are likely to



As we got a brief idea about what their business is, so let’s start gathering the data.

Step 2: Data Gathering

What kind of data will be required to predict customer churn?

We require customer data, list of services, plans and cost details etc. We can always check with the client what data they have and can always request the client to provide more data as and when required.

To keep things simple, we'll use an open source dataset

Telco Customer Churn


for this blog

It’s a fictional dataset created by


IBM and is available on


Download the dataset from the above link and save it for further analysis.

Step 3: Data Cleaning

Some Data Cleaning techniques are correcting typos in the data, removing special characters, converting from one data type to another, datetime formatting etc.

We can either use

Jupyter notebook


Google colab

to clean the data, analyze the data,and to build and optimize a model.

import pandas as pd # python library to load and clean the dataset
df = pd.read_excel("Customer_Churn.xlsx") # read and load the dataset
pd.set_option('display.max_columns', 50) # this setting displays all columns without any break 

Dataset displayed in a tabular form

df.shape # To check no. of rows and no. of columns in the dataset

Shape of Dataset

First, let’s list out all the columns present in the dataset and then understand each column from the given metadata.

df.columns # displays the names of the columns in the dataset

To display the list of columns

List of all columns in the dataset

Meta info

Meta info

df.dtypes # to check the datatype of all columns 

Datatypes of all columns in the dataset

Let’s modify the dataset as per above observations. To do so, first let’s observe few records of the dataset.

df.head() # displays first five records of the dataset

First 5 records of all columns in the dataset

Modifying the dataset based on observations from previous steps

df.drop(columns=['Lat Long'], axis = 1,inplace=True) # Removing 'Lat Long' column as it is duplicate
df['Zip Code'] = df['Zip Code'].astype(str) # Converting 'Zip Code' column to object data type
df['Total Charges'] = pd.to_numeric(df['Total Charges'], errors='coerce') # Converting 'Total Charges' column to numerical i.e. int/float datatype

Now let’s quickly check if the modifications are done.

print("Total Charges: ",df['Total Charges'].dtype,"\nZip Code: ", df['Zip Code'].dtype)

datatype of Total Charges and Zip code changed after modifications

Let’s take a look at the sample of modified dataset

df.head() # displays first 5 records of the dataset

Sample of modified dataset

Columns after modifications

From the above output you can see that

Lat Long

is also removed.

If we are working with a real time dataset, then

at this stage

it is recommended to



cleaned dataset

in the cloud databases for future usage.

Now this

cleaned data

can be used for

Exploratory Data Analysis



One should always remember that the way we define the objective, the way we gather data and the way we clean/format the data will


depending on the


and the


we have.

That’s it for this blog. Next in this series, we'll see how to perform

Exploratory Data Analysis

(EDA) on the cleaned dataset, so stay tuned!!

Thanks for reading!!

Hang out with us

Join our community and chat about startups, AI/ML, and product development.

Like what you see? Join the team.

Mage is making AI and ML accessible to product developers. Join us and build beautiful and intuitive devtools.

Want to give us feedback or ask questions?

Please chat with us live by joining our Discord channel or send us an email.