Feature Engineering - Average aggregate

First published on February 22, 2022

Last updated at April 19, 2022

5 minute read

Felicia Kuan

TLDR

In this Mage Academy lesson on feature engineering, we’ll be covering how we’d group rows by values in a feature to find the average of each group.

Glossary

How it works
Example code
Magical no-code solution ✨🔮

How it works

First, I’d like to propose a scenario to help us better understand what obtaining the average aggregate entails. Let’s say I want to move to a new city in the Bay Area with amazing boba. Therefore, I would need to determine which cities have the highest average rating at their boba shops.

Map of boba stores from Lemonhearted

To solve this problem, we’d first need to group each boba shop by city.

Group by

, a function that groups the rows by a distinct value in a column, does this job perfectly. Then, used in conjunction with group by, will be

average aggregate

, which finds the average values of a column for each group. Aggregate functions create a new column of data from existing features, so for our scenario, we will apply aggregate to find the average shop rating for each city.

Example code

To implement the analysis described above, we’ll perform the group by and average aggregate operations on this comprehensive

of boba shops in the Bay Area to find the average rating of each city.

From scratch

The first step is to group boba shops by city, so we’ll store as a list of tuples inside a dictionary. To do this, we’ll iterate through the rows of data to save the 69 unique city names as keys to the dictionary.

Since the columns of importance for our analysis are only the ‘name’ of the boba shop, the ‘city’ it’s located in, and the shop’s Yelp ‘rating,’ I’ll only save these details in the dictionary.

Thus, below is an example output of this dictionary for some context on what our code is trying to achieve:

'Milpitas': [('Tea Villa', 4.0), ('i-Tea', 4.0), …]

From this, we can see that one of the keys of the dictionary, “cities,” is Milpitas. The values would be a list of tuples that records the name of the boba shop with the Yelp rating.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cities = {}

for _, row in df.iterrows():

  key = row['city']
  if key not in cities:
    # save shop name + rating as tuple
    cities[key] = [(row['name'], row['rating'])]

  else:
     # add additional shops to city
     if row['name'] not in cities[key]:
       cities[key].append( (row['name'], row['rating']) )

cities

Now that we have all the shops grouped by city, our next order of business is to take the average rating for the 69 groups. Below is the calculation process for finding the average. Though it’s not required, the “optional” section further details the process and helps us check our work.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
agg_averages = {}

for key in cities:
   rating_sum = 0
   # for each city, sum the rating
   for shop in cities[key]:
      rating_sum += shop[1] # in the tuple: shop (name, rating)

   average = float(rating_sum)/float(len(cities[key]))

   # optional print:
   print("Average for {}: {} / {} = {}"
         .format(key, rating_sum, len(cities[key]), average))

   agg_averages[key] = average

list(agg_averages.items())

Output for calculating averages (first 8 lines):

Output for agg_averages (first 8 lines):

From the output, we can see that the calculations of rating: total star count divided by the review number. With this data, we can now have a list of top cities to move to that have the best boba!

Using Pandas

While the previous section showed you how to find the average of each group from scratch, this section details a set of powerful Pandas built-in functions that helps you achieve the same thing in one line.

Like before, we will use the

groupby()

function to group the rows of data by city. Because Pandas’

function must always be accompanied by an aggregate function, we directly pass the output of groupby to the aggregate function

mean()

. The following code finds the averages of the ratings per group:

1
2
gp = df.groupby(['city'])['rating'].mean()
pd.DataFrame(gp)

The averages found using the built-in functions match the averages we found from scratch, which cross-checks our values.

To double check, here are screenshots of the first and last five averages calculated in the earlier section.

I found these ten averages for the cities’ match, but if you want to check even further, make sure you try out the code snippets and iterate through all 69 cities to compare averages!

For your convenience, we’ve provided the full code snippet for trying

average aggregate

with Pandas:

1
2
3
4
5
import pandas as pd

df = pd.read_csv('bayarea_boba_spots.csv')
gp_by_city = df.groupby(['city'])['rating'].mean()
pd.DataFrame(gp_by_city)

Again, we solved our problem and found the average rating of boba shops in each city!

Magical no-code solution ✨🔮

On those days where we cannot be bothered to write even one line of code, try our service, Mage! Aggregate functions create a new column of data from existing features, so watch how the new column “average” is created from the Yelp “rating” and “city” columns in this boba dataset:

To find the average value of rating by groups of cities, go to Edit data > Aggregate. Enter “city” for the column that we’ll be grouping by. Then, select “average” for the type of aggregate function, and select “rating” for the column we’ll be performing the calculation on.

Want to learn more about machine learning (ML)? Visit

! ✨🔮

GET STARTED

COMMUNITY

LEARN

Magical powers for data teams

GET STARTED

COMMUNITY

LEARN

2024 Mage Technologies, Inc.