Feature Engineering - Count aggregate

First published on February 24, 2022

Last updated at April 22, 2022

 

5 minute read

Felicia Kuan

Growth

TLDR

In this Mage Academy lesson on feature engineering, we’ll learn how we’d count the number of distinct values in a column by group.

Glossary

  • How it works

  • Example code

  • Magical no-code solution ✨🔮

How it works

In large datasets, it’s useful to utilize at-a-glance metrics like the amount of unique values in a column to gain familiarity with the kind of data we will be working with. 

Cropped Mr. Sun menu

To better explain what an aggregate function is, I’ll explain by example. If we owned this Mr. Sun boba shop, at the end of the day, we can use 

group by

to group all the boba sold today into the categories detailed in the menu above. Then, we can use the aggregate function, 

count unique values

, to tally the number of unique drinks sold per category. 

With this analysis, would we see that every drink had been sold at least once today? If we see that we didn’t sell even one Mango, Strawberry, and Honey Kiwi smoothie for more than a couple days, should we consider making these drinks seasonal, so we don’t need to supply our store with fresh fruits during the winter months? 

Thus, count unique values, as the name suggests, is an aggregate function that inspects a column and returns a count of the number of different values per group. 

Example code

In this section, I’ll detail the steps to group the boba drinks by a unique value in the column “Category”. Once we have the groups, we’ll total the amount of unique drinks that were sold in each category to obtain the count of unique values.

1
2
3
4
import pandas as pd

df = pd.read_csv('boba-history.csv')
df

Dictionaries are Python’s most straightforward way of storing and counting unique string values. Thus, we’ll use a dictionary key to keep track of the unique category names, like “Mr. Sun Specialties” or “Milk Tea.” We’ll iterate through our dataset, saving the drink names as list values with the group that the drink is categorized as. For example, under the key– or group– Milk Tea, the drinks Earl Grey and Angel Caramel are listed as values:

{ ‘Milk Tea’: [‘Royal Earl Grey Milk Tea, ‘Angel Caramel Milk Tea’]}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
categories = {}

for _, row in df.iterrows():

  key = row['Category']
  if key not in categories:
    # save drink name to category
    categories[key] = [row['Drink']]
  else:
    # check if drink name is already added
    if row['Drink'] not in categories[key]:
      categories[key].append(row['Drink'])

categories

Lastly, to get the number of unique drinks in each category, we can use the Python 

len()

function since all the drinks for each category are stored in the corresponding list.

And here are the results! We have 1 unique drink ordered from the “Fresh Tea” category, 2 from “Milk Foam,” etc. Looking at the extensive menu Mr. Sun has, if only 1 or 2 drinks are ordered from each category of our menu each day, maybe we can consider reducing our menu so we can minimize the storage space of all the raw ingredients used to make the less popular drinks. 

For our next example, I’d like to pose a different question from the previous section. Let’s say I want to open a new Mr. Sun branch in a different Bay Area city. It would be useful to get a count of how many competitors I have per city!

Using these two Pandas built-in aggregate functions, 

groupby()

and 

count()

, we’ll be analyzing this Bay Area boba spots 

dataset

to find how many boba shops (count) are in each city (group by). 

1
2
3
4
import pandas as pd

df = pd.read_csv('bayarea_boba_spots.csv')
df

Then, we’d use groupby() a built-in Pandas function that returns an object similar to the dictionary we created in the “From scratch” section. It’s messy to view the object, but I wrote a helper function “

print_obj

” that prints the unique list of cities in the Bay that have at least one boba shop:

Now that we have the groups, we can use .count() to count the number of unique boba shops by city.

1
2
gp = df.groupby(['city'])['name'].count()
gp

Thus, the complete code needed to perform our tally of boba shops per city is:

Magical no-code solution ✨🔮

Additionally, our product, Mage, has functions that can help you find the distinct count in of a feature based on the groupings of another.

After uploading your data, you can find the no-code function using Edit data > Aggregate. Then, select the column you’d like to group by from the drop-down menu (“city” for our example). Make sure you select “distinct count” to count the unique values. Finally, select the column you’re counting unique values in (I selected “name” of the boba shop”). 

Want to learn more about machine learning (ML)? Visit 

Mage Academy

! ✨🔮