Data Cleaning - Variance

First published on February 18, 2022

Last updated at March 9, 2022

 

7 minute read

Jahnavi C.

TLDR

In this Mage Academy lesson on data cleaning, we’ll go over variance in detail and see how to identify and remove low variance columns from a dataset.

Glossary

  • Why is it necessary

  • Variance

  • How to code

Why is it necessary?

Data distribution

We can remove columns from the dataset if the columns aren’t useful for predicting the output. Low variance columns are such columns that don’t contribute much while predicting the output as they don’t contain much information. Therefore, it's recommended to remove the low variance columns.

Variance

Variance measures the spread of data, i.e., it measures how far each data point is from the mean.

Variance is calculated using the following formula:

Mathematically we can write variance as shown below:

Zero

variance indicates that 

all

the values in the column are 

constant

Low

variance indicates that 

most

of the values in the column are 

similar 

and are very close to mean.

High

variance indicates that values in the column are 

not similar

and are spread far from the mean.

Numerical data

Usually we calculate variance by using a formula. But for 

data columns we don’t use a formula, instead we visualize the distributions of the categories with the help of Python’s visualization libraries like seaborn, matplotlib, etc. 

Zero

variance indicates that the distribution of categories in the column are identical.

Low

variance indicates that the distribution of categories in the column are nearly the same.

High

variance indicates that the distribution of categories in the column are 

not 

similar and vary.

Calculate variance

From scratch

Let’s take one column and see how we calculate variance for 

data.

Step-1: Calculate mean

Step-2: Find the difference between each data point and mean

Step-3: Square the difference values

Step-4: Sum all the squared difference values

Step-5: Calculate variance

Using pandas library (for numerical data)

Let’s calculate variance for all the columns in the dataset that has numerical data.

Step-1: Load the 

using Python’s pandas library. We use the 

read_csv

function to read files that have the 

.csv

extension.

Step-2: Calculate variance of each column using 

.var()

function

Step-3: Remove columns if variance is low.

Variance of “history” and “physics” columns is low when compared to “english” and “math” columns variance, so we can remove these columns from the dataset.

Using pandas library (for categorical data)

Step-1: Load the 

using Python’s pandas library. We use the 

read_csv()

function to read files that have the 

.csv

extension.

Step-2: Plot the distribution of categorical columns using Python’s seaborn library.

We’ll use the 

countplot() 

function to visualize the distribution.

Step-3: Drop columns if variance is low.

Variance of “school” and “pass” columns is low, so we can remove these columns from the dataset.

How to code

Using Pandas library: 

We’ve seen that “history,” “physics,” “school” and “pass” columns have low variance. So, we’ll use the 

.drop()

method to remove these columns.

Magical no code solution

When you're building models with Mage, it’s easy to remove columns. 

Want to learn more about machine learning (ML)? Visit 

! ✨🔮