Data Cleaning - Remove duplicates

First published on February 9, 2022

Last updated at April 19, 2022

 

3 minute read

Felicia Kuan

TLDR

In this Mage Academy lesson on data cleaning, we’ll learn how to remove duplicate row entries of a column value in Pandas.

Outline

  • When’s it necessary?

  • How to code

  • Magical no-code solution 🪄

When’s it necessary?

Duplicate data can skew prediction results.

Thus, for columns that should contain unique values, it’s important to search for and exclude any duplicate rows to achieve a more general and accurate prediction.

How to code

Observing Kaggle’s 

for example, the extra rows containing “Mega” versions of Pokemon aren’t needed to analyze the entire Pokemon index, since Megas are simply beefier copies of the same Pokemon.

Thiagoazen’s Pokemon dataset, ft. 3 Charizards

From scratch

While a built-in function (see next section) gets the job done, we will also present an algorithm that filters unique values of a column using a dictionary, just in case it shows up in an assignment or exam. 😉

By looking at the first ten rows of data, we can see several duplicates in the “Name” column that we need to remove (like Venusaur).

1
2
3
import pandas as pd
data = pd.read_csv("PokemonDb.csv")
data

Thus, we store only the first occurrence of a Pokemon’s name in the dictionary. As we check the rows one by one (using 

), we check if the name is already in the dictionary and 

the row if it is.

1
2
3
4
5
6
7
8
9
uniqueNames = {}
    # Keeps only the first duplicate
    for i, row in data.iterrows():
       if row["Name"] in uniqueNames:
         data.drop(i, inplace=True)

       uniqueNames[row["Name"]] = True

    data

The complete code if you’d like to try it yourself:

Built-in Pandas Function

The promised built-in function, 

deletes rows based on duplicates in a list of column name(s) that you specify in the 

subset

parameter.

Image generated using carbon.now.sh

Magical no-code solution 🪄

Last, but definitely not least, Mage has a row transformation action that removes duplicates from your dataset! Try this if you’d like to leverage AI without learning the ins and outs of Pandas.

Want to learn more about machine learning (ML)? Visit 

! ✨🔮