Guide to Churn Prediction : Part 5— Graphical analysis

First published on February 9, 2022

 

10 minute read

Jahnavi C.

TLDR

In this blog, we’ll explore discrete and categorical features in the Telco Customer Churn dataset using univariate graphical methods.

Outline

  • Recap

  • Before we begin

  • Univariate graphical analysis

  • Conclusion

Recap

In part 4 of the series,

, we analyzed and explored continuous data features in the

Telco Customer Churn

dataset using graphical methods.

Before we begin

This guide assumes that you are familiar with data types. If you’re unfamiliar, please read blogs on

and

data types.

Statistical concepts

Let’s go over a couple of statistical concepts

Balanced data

Balanced

The data is said to be

balanced

if the number of records in each category is equal or nearly equal.

Imbalanced data

Imbalanced: Image by Mediamodifer from Pixabay

Data is said to be

imbalanced

if the number of records in one category is greater than the number of records in other categories.

Note

: If the

target

feature has

categorical

data, we’ll look at how data is distributed across all of the categories and check if the feature has

balanced

or

imbalanced

data.

Univariate graphical analysis

The main purpose of univariate graphical analysis is to understand the distribution patterns of features. To visualize these distributions, we’ll utilize Python libraries like matplotlib and seaborn. These libraries contain a variety of graphical methods (such as bar plots, count plots, KDE plots, violin plots, etc.) that help us visualize distributions in different styles.

Now, let’s perform univariate graphical analysis on

discrete

and

categorical

data features.

Import libraries and load dataset

Let’s start with importing the necessary libraries and loading the cleaned dataset. Check out the link to

to see how we cleaned the dataset.

1
2
3
4
5
6
7
8
1 import pandas as pd
2 import matplotlib.pyplot as plt # python library to plot graphs
3 import seaborn as sns # python library to plot graphs
4 %matplotlib inline # displays graphs on jupyter notebook

5
6 df = pd.read_csv('cleaned_dataset.csv')
7 df # prints data set

Cleaned dataset

Identify discrete and categorical features

Discrete

features are of

int

data type, while

categorical

features are of

object

data type.

Note

: Sometimes categorical data is represented in the form of numbers. So if the data type of a feature is

int

and has unique values (1,2,3,4,5 or 0 and 1, etc.) or categories, then it’s a categorical feature; otherwise, it’s a discrete feature.

So let’s check the data types of features using the

dtypes

function and identify discrete and categorical features.

1
df.dtypes

Data types of features

Observations:

  1. ”Country,” ”State,” “City,” “Zip Code,” “Gender,” “Senior Citizen,” “Partner,” “Dependents,” “Phone Service,” ”Multiple Lines,” “Internet Service,” “Online Security,” “Online Backup,” “Device Protection,” “Tech Support,” “Streaming TV,” “Streaming Movies,” “Contract,” “Paperless Billing,” “Payment Method,” “Churn Label,” “Churn Value,” and “Churn Reason” features are of

    object

    data type, so these are

    categorical

    features.

  2. “Count,” “Tenure Months,” “Churn Value,” “Churn Score,” and “CLTV” features are of the

    int

    data type. So let’s look at the values in these features and decide if they’re discrete or categorical features.

Display the

int

data type features using

select_dtypes() 

function.

1
df.select_dtypes(int)

Features of int data type

Observations:

  1. The “Count” and “Churn Value” features’ data is in the form of 1’s and 0’s. So these are categorical features.

  2. “Tenure Months,” “Churn Score,” and “CLTV” are discrete features.

Create new datasets

Based on the type of data, separate the features and create 2 new datasets.

Create a dataset

df_disc 

that

 

contains all the discrete features and display the first 5 records using

head()

method.

1
2
df_disc = df[['Tenure Months','Churn Score','CLTV']]
df_disc.head()

Discrete features

Create a dataset

df_cat 

that

 

contains all the categorical features and display the first 5 records using

head()

method.

1
2
3
4
5
6
7
df_cat = df[['Country','State','City','Zip Code','Count','Gender','Senior Citizen',
             'Partner','Dependents','Phone Service','Multiple Lines','Internet Service',
             'Online Security','Online Backup','Device Protection','Tech Support','Streaming TV',
             'Streaming Movies','Contract','Paperless Billing','Payment Method',
             'Churn Label','Churn Value','Churn Reason']]

df_cat.head()

Categorical features

Distribution plots

We visualize discrete and categorical features distributions using graphical methods like

count plots

,

bar plots

,

pie charts

, etc.

Count plots

: These plots are graphical representations of the count of individual values in each category of a dataset. Each bar represents a unique value or a category. The length of each bar represents the number of values in each category.

Discrete data plots

1
2
3
4
5
6
7
8
fig = plt.figure(figsize=(14, 8)) # sets the size of the plot with width as 14 and height as 8
for i,columns in enumerate(df_disc.columns):
    ax = plt.subplot(2,2,i+1) # creates subplots in 2 rows with upto 3 plots in each row
    sns.countplot(data = df_disc, x = df_disc[columns]) # creates count plots for each feature in df_disc dataset
    ax.set_xlabel(None) # removes the labels on x-axis
    ax.set_title(f'Distribution of {columns}') # adds a title to each subplot
    plt.tight_layout(w_pad=3) # adds padding between the subplots
plt.show() # to display the plots

Count plots of discrete features

Let’s take a closer look at the “Tenure Months” plot.

“Tenure Months” count plot

Observations:

Approximately 600 customers have been with the company for one month, and nearly 400 customers have been with the company for 72 months.

Categorical data plots

1
2
3
4
5
6
7
8
9
fig = plt.figure(figsize=(14, 22)) # sets the size of each subplot with width as 14 and height as 22
for i,columns in enumerate(df_cat.columns[4:-2]): 
    ax = plt.subplot(7,3,i+1) # creating a grid with 7 rows and 3 columns, it can display upto (7*3)=21 subplots.
    sns.countplot(data=df_cat, x = df_cat[columns]) # creates count plots for each feature in df_cat dataset
    ax.set_xlabel(None) # removes the labels on x-axis
    ax.set_title(f'Distribution of {columns}') # adds a title to each subplot
    plt.xticks(rotation = 25) #rotate the x-axis values by 25 degrees.
    plt.tight_layout(w_pad=3) # adds padding between the subplots
plt.show() # displays the plots

Count plots of categorical features

Observations:

The company is providing various services to the customers like phone, internet, multiple telephone lines and other additional services like online security, online backup and device protection plans.

Now, let’s take a closer look at all the plots.

Observations:

  1. All the values in the “Count” column are identical.

  2. The male to female customer ratio is nearly equal, and the majority of them are non-senior.

  3. The majority of the customers are either single or don’t have any dependents.

Observations:

  1. Most of the customers have a phone service subscription, and nearly half of them have multiple telephone lines.

  2. The company’s internet services were used by the majority of its consumers. Fiber optic is the most popular internet connection among the company’s customers.

Observations:

  1. Customers can subscribe to additional services such as online security and backup, but just a small percentage of customers have taken advantage of these.

  2. The majority of customers are on a month-to-month contract.

Now, let’s take a look at the distribution of categories in the target feature “Churn Label” and see if the data is balanced or imbalanced.

Yes represents churned customers, while No represents non-churned customers.

Observations:

When compared to the number of non-churned consumers (~5000), the number of churned customers is quite low (~1900) i.e. the data is not evenly distributed among the categories. So this indicates that the data is

imbalanced

.

Conclusion

As seen, univariate graphical analysis is the simplest way of analyzing data. This analysis helps us comprehend the data better.

Source: GIPHY

That’s it for this blog. Next in the series, we’ll perform multivariate graphical analysis and find reasons for customer churn.

Thanks for reading!!