Data Cleaning - Variance

First published on February 18, 2022

Last updated at March 9, 2022

 

7 minute read

Jahnavi C.

Growth

TLDR

In this Mage Academy lesson on data cleaning, we’ll go over variance in detail and see how to identify and remove low variance columns from a dataset.

Glossary

  • Why is it necessary

  • Variance

  • How to code

Why is it necessary?

Data distribution

We can remove columns from the dataset if the columns aren’t useful for predicting the output. Low variance columns are such columns that don’t contribute much while predicting the output as they don’t contain much information. Therefore, it's recommended to remove the low variance columns.

Variance

Variance measures the spread of data, i.e., it measures how far each data point is from the mean.

Variance is calculated using the following formula:

Mathematically we can write variance as shown below:

Zero

variance indicates that 

all

the values in the column are 

constant

Low

variance indicates that 

most

of the values in the column are 

similar 

and are very close to mean.

High

variance indicates that values in the column are 

not similar

and are spread far from the mean.

Numerical data

Usually we calculate variance by using a formula. But for 

categorical

data columns we don’t use a formula, instead we visualize the distributions of the categories with the help of Python’s visualization libraries like seaborn, matplotlib, etc. 

Zero

variance indicates that the distribution of categories in the column are identical.

Low

variance indicates that the distribution of categories in the column are nearly the same.

High

variance indicates that the distribution of categories in the column are 

not 

similar and vary.

Let’s take one column and see how we calculate variance for 

numerical

data.

Step-1: Calculate mean

Step-2: Find the difference between each data point and mean

Step-3: Square the difference values

Step-4: Sum all the squared difference values

Step-5: Calculate variance

Let’s calculate variance for all the columns in the dataset that has numerical data.

Step-1: Load the 

dataset

using Python’s pandas library. We use the 

read_csv

function to read files that have the 

.csv

extension.

Step-2: Calculate variance of each column using 

.var()

function

Step-3: Remove columns if variance is low.

Variance of “history” and “physics” columns is low when compared to “english” and “math” columns variance, so we can remove these columns from the dataset.

Step-1: Load the 

dataset

using Python’s pandas library. We use the 

read_csv()

function to read files that have the 

.csv

extension.

Step-2: Plot the distribution of categorical columns using Python’s seaborn library.

We’ll use the 

countplot() 

function to visualize the distribution.

Step-3: Drop columns if variance is low.

Variance of “school” and “pass” columns is low, so we can remove these columns from the dataset.

How to code

We’ve seen that “history,” “physics,” “school” and “pass” columns have low variance. So, we’ll use the 

.drop()

method to remove these columns.

When you're building models with Mage, it’s easy to remove columns. 

Want to learn more about machine learning (ML)? Visit 

Mage Academy

! ✨🔮

Start building for free

No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.

No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.