Feature Engineering - Total sum

First published on February 16, 2022

Last updated at April 19, 2022

 

7 minute read

Felicia Kuan

TLDR

In this Mage Academy lesson on feature engineering, we’ll learn how we’d aggregate, or sum up, values by group to find total items sold.

Glossary

  • How it works

  • Example code

  • Magical no-code solution ✨🔮

How it works

Since the concepts, aggregate, group by, and total sum, are a little tech jargon-y, this section defines these concepts by example to help those who are new to these functions.

Cropped Mr. Sun menu

Imagine you own this shop, and you’re looking through today’s sales 

at the end of the day. Each drink sold is logged with a name, price, and category.

Group by

is a function that groups the rows by a distinct value in a column. Let’s say that Mage made a group order of 13 bobas one Thursday afternoon. Instead of throwing them all into one bag, the manager placed all 7 drinks from the “Milk Tea” category in one bag, 2 of the “Mr. Sun Specialties” in another, etc to see, at a glance, which category is most popular with Mage employees. Thus, group by is a function that groups rows that have matching values in a certain column; in our boba example, we grouped by menu category, but we can group by any column. 

Why would grouping by category be useful to you? Perhaps some drink categories require more labor, time, or ingredients to make, so we need to find out and ensure they’re all profitable. 

Art from San-X

Since we’d like to sum the profits by category to see if the profits outweigh the input, using an aggregate function would be perfect in this case. An 

aggregate

function is often used in conjunction with 

group by

to perform a calculation on a column to create new data from existing features. While there are many kinds of aggregate functions (ie. count, average, minimum, max), today we’ll be discussing 

total sum

, which is summing values in the “Price” column based on groups in the “Category” column. 

Example code

In this section, we’ll implement aggregate by total sum using the example boba shop scenario mentioned above.

The general procedure implementing aggregate by total sum in code is:

  1. Group rows by unique value in the “Category” column using a dictionary

  2. For each row entry in each group, sum the prices to obtain profits per category

From scratch

Based on their menu, I made a list of drinks sold today at the shop (and this does not reflect 

actual sales since I’m not their manager 😳). To try this example, you can download data 

and follow along!

1
2
3
import pandas as pd
df = pd.read_csv('boba-history.csv')
df

Hypothetical sales for today

Our first order of business is to re-organize our data and group them by category. The best way to do this is to go through row by row (iterate) and save the drink name and price to a unique key in a dictionary. 

In this step, we are essentially creating boxes labeled by a key (ie. ‘Fresh Tea’, ‘Milk Foam Series’) and storing information about the drink (name and price) for every drink that falls under that category as a tuple. Here’s an example of a category “Smoothies” saved as a dictionary key, with the values as a list that contains drink name and price:

{‘Smoothies/Slushies’ : [ (‘Mango Smoothie’, $6.25) }

1
2
3
4
5
6
7
8
9
10
11
12
categories = {}

for _, row in df.iterrows():
  # create new key for each unique category we encounter
  key = row['Category']
  if key not in categories:
    categories[key] = []

  # add name + price to each group
  categories[key].append((row['Drink'], row['Price']))

categories

This dictionary/object format is loosely how Pandas’s built-in function groupby() saves groups, so remember it for the next section!

Now that we have sorted our boba drinks by category, we can finally sum the price by group. Thus, for every group (key), we’ll sum all of the prices (found in the second index of each drink) to get the total sales of each boba drink category.

1
2
3
4
5
6
price = {}
for key in categories:
   price[key] = 0
   # for each group, sum the price
   for drink in categories[key]:
      price[key] += drink[1] # the price

From our results, we can see that the “Milk Tea” category has earned us the most, which makes sense since we sold 7 cups of boba from this category. 

Using Pandas

Pandas supports 

and aggregate functions, so we can simplify the process we fleshed out in the previous section. We can use the Pandas function to compartmentalize our drinks into their respective categories. 

However, since the groupby function is used with an aggregate function (like finding the average, count, total sum, etc), the function returns a weird object that doesn’t quite know what to do with itself because it’s expecting you to call an aggregate function after.

In this next section, there’s no need to run this code as it’s just dismantling the groupby object to explicate that the drinks are indeed grouped by categories:

Notes about the code: 

  1. We can use 

    to view the unique categories that we’ve grouped all of our drinks into. In other words, these are all the unique values found in the “Category” column of our dataset, and all drinks are categorized under one of these groups.

  2. Since “key” is one of the group names, we use 

    get_group(key)

    to view the drinks that are categorized under this group. 

  3. We observe that there are seven orders under the category “Milk Tea,” which we will mention later

Now that we’ve confirmed that all the drinks are sorted into categories, we can now apply sum() to total the profits made in each category. This is the only code you have to run, convenient, right?

1
2
gp = df.groupby(['Category'])['Price'].sum()
gp

We can also observe that the total sums we obtained in this section matches those from the “From scratch” section earlier, which (probably) means it’s correct! 

The complete code to obtain the sum of price by category is:

Magical no-code solution ✨🔮

On days when you don’t feel like writing even one line of code, try our service, Mage! Our “Aggregate” function will make you a column that tells you the total sum of each group. Take a look at our demo below:

Want to learn more about machine learning (ML)? Visit 

! ✨🔮