In this blog we'll implement the first 3 steps on a customer dataset for Churn Prediction.
By now you should be familiar with all the steps required to build a machine learning model. If you didn’t get a chance to go through it, feel free to check out thisblog
, where I explained in detail all the steps required to build and optimize a ML model.
Defining the Objective
A Telecom company recently noticed that many of their
unexpectedly. If this continues, the company will incur huge losses. So they immediately worked on a plan to retain their customers.
The company came up with two solutions to solve the issue:
and offer special discounts to
their customers in order to retain the customers who might churn. The problem with this solution is that, giving huge discounts to all the customers may cause
or the company may produce very little profit.
The company has a huge amount of
about the customers including the
customers who left
, so they thought that analyzing their past data may help them
which customer is likely to churn and if a special offer can retain them. In hopes to reduce additional losses, the company decided to go with the second solution and approached a
to help them analyze the data and to predict the customers likely to churn.
So without any further delay, let’s dive into the world of data scientists and see how they approach this problem to predict customer churn.
Step 1: Define the Objective
Understand the business
It’s a telecommunications company that provides
services to residents in the USA.
Identify the problem
The company noticed that their customers have been churning for a while. And this has impacted their customer base and business revenue, hence they need a plan to retain their customers.
Build a machine learning
predict the customers
who are likely to
As we got a brief idea about what their business is, so let’s start gathering the data.
Step 2: Data Gathering
What kind of data will be required to predict customer churn?
We require customer data, list of services, plans and cost details etc. We can always check with the client what data they have and can always request the client to provide more data as and when required.
Download the dataset from the above link and save it for further analysis.
Step 3: Data Cleaning
Some Data Cleaning techniques are correcting typos in the data, removing special characters, converting from one data type to another, datetime formatting etc.
We can either use
to clean the data, analyze the data,and to build and optimize a model.
Importing necessary libraries for Data Cleaning.
1import pandas as pd # python library to load and clean the dataset
Loading the dataset
1 2df = pd.read_excel("Customer_Churn.xlsx") # read and load the dataset pd.set_option('display.max_columns', 50) # this setting displays all columns without any break
Dataset displayed in a tabular form
3.1. Check how many rows and columns(also known as features) are in the dataset
1df.shape # To check no. of rows and no. of columns in the dataset
Shape of Dataset
3.2. Look for duplicate features by going through the metadata provided.
First, let’s list out all the columns present in the dataset and then understand each column from the given metadata.
1df.columns # displays the names of the columns in the dataset
To display the list of columns
List of all columns in the dataset
3.3. Identify Numerical and Categorical features in the gathered data and check if formatting is required or not.
1df.dtypes # to check the datatype of all columns
Datatypes of all columns in the dataset
Let’s modify the dataset as per above observations. To do so, first let’s observe few records of the dataset.
1df.head() # displays first five records of the dataset
First 5 records of all columns in the dataset
Modifying the dataset based on observations from previous steps
1 2 3df.drop(columns=['Lat Long'], axis = 1,inplace=True) # Removing 'Lat Long' column as it is duplicate df['Zip Code'] = df['Zip Code'].astype(str) # Converting 'Zip Code' column to object data type df['Total Charges'] = pd.to_numeric(df['Total Charges'], errors='coerce') # Converting 'Total Charges' column to numerical i.e. int/float datatype
Now let’s quickly check if the modifications are done.
1print("Total Charges: ",df['Total Charges'].dtype,"\nZip Code: ", df['Zip Code'].dtype)
datatype of Total Charges and Zip code changed after modifications
Let’s take a look at the sample of modified dataset
1df.head() # displays first 5 records of the dataset
Sample of modified dataset
Columns after modifications
From the above output you can see that
is also removed.
If we are working with a real time dataset, then
at this stage
it is recommended to
in the cloud databases for future usage.
can be used for
Exploratory Data Analysis
One should always remember that the way we define the objective, the way we gather data and the way we clean/format the data will
depending on the
That’s it for this blog. Next in this series, we'll see how to perform
Exploratory Data Analysis
(EDA) on the cleaned dataset, so stay tuned!!
Thanks for reading!!