Terraforming a planet requires large scale projects to inhabit other planets for survival. We’ll begin by terraforming datasets to calculate the cost of survival on the Titanic.
Before we begin
Titanic meets Iceberg (Source: Britannica)
In “Product developers’ guide to getting started with AI — Part 3: Terraforming dataframes”, we’ll look at the price point of a “golden ticket” that ensures the best chance of survival. Based on the SHAP values calculated there is a direct correlation between the sex, passenger class, fare, and age.
Mage Analyzer Page (Source: SHAP)
Manipulating datasets are a quick and easy way to rearrange data and extract everything. In this series we’ve gone over how to pick and search through data so it’s time to look at transforming the underlying data.
It is highly advised to have readpart 2
before continuing forward.
In this guide, we’ll be using theTitanic dataset
along withGoogle Collab
I’ll be briefly reusing techniques from previous contents such as surfing and extracting to quickly start us off with an ideal dataframe for applying transformations and functions.
Part 2: Surfing through dataframes
Python is a functional programming language, which means that all operations can be expressed as a function. This is important as later on in this guide we’ll be looking at creating functions and passing lambda expressions to
. For those that are comfortable enough with Python, you may skip this section. Otherwise, keep reading for a quick refresher on the syntax for defining functions and lambda expressions.
In Python, a function is created by the “def” keyword and takes in a number of arguments.
Basic Adder that adds 1 to the value
Rewrite the adder function as a lambda expression to shorthand.
Lambda expression of the adder
For a small operation, like the adder above, it’s best practice to use a lambda expression. But, for more complex calculations that are used multiple times use a function. When in doubt check if there is a simpler way or how much repeating will occur.
The simplest form of manipulating a dataframe is by using
. Apply takes in a function and repeats it for either all columns or rows within a dataframe. The applications of this are for quickly calculating or encrypting data.
Based on the SHAP values, we form a hypothesis that women and children are more likely to survive, possibly due to the fact that they can board first and when living in upper class areas of the ship there is less population density allowing them to quickly escape in comparison to the lower class.
Lifeboats on the Titanic (Source: DailyMail)
To find the average price point of the winning ticket: ticket for a young lady in 1st class, we first need to filter down our rows and columns. In the dataframe, “Pclass” represents whether a passenger is located in the 1st class, 2nd class, or 3rd class area of the Titanic. The average is calculated as the
of the prices divided by the total number or
of items, but may also be calculated by the
Having the sex of a female
Passenger class of only 1st class
Age must be no lower than 40 years old
Then reduce it to only show the relevant information: ‘Fare’ or price of golden ticket.
Then, we take the sum of the ‘Fare’ column and divide by the total number of items.
The total price of all golden tickets are $6484.80
Average price of $113.77
We can confirm this is the same when calculating the
of the prices.
The mean matches the average price of $113.77
Pandas has multiple other built-in mathematical functions, such as
Median is $86.50
Unfortunately, all of this must be done separately, which makes
good for short functions, but what about longer functions? That’s where
shines in removing repeatability.
If you know which aggregate you want to apply ahead of time, use agg instead. When doing multiple calculations of summation, mean, or standard deviation,
is a neater way to calculate than using apply.
For instance, if we were to use
instead, we could grab multiple types all at once. For our next section, we’ll need the standard deviation so let’s calculate that as well. Note: The shorthand is
, which is functionally equivalent to
1 liner for sum, mean, max, and median
Another way of manipulating a dataframe is by using
. This is similar to
, except that it applies the function to itself and repeats it for all columns within a dataframe. Since it can be applied to itself, the applications are more extended and can complete multiple operations by passing values back to itself.
Because transform applies it to itself, the result must be the same length of the original input. This means that functions such as sum(), mean(), and max/min() don’t work as they condense or aggregate all the data into 1 value.
Calculate individual percentages
Back to the original problem, find out what percentage of passengers have a “golden ticket”. Using transform, we can combine aggregation using a series to calculate the individual values. This makes transform more useful at looking at the finer details.
Calculate individual percentages
Likewise, summing the individual results should result in 1.0 (100%)
To find out how many passengers paid top dollar, first we take the original dataset and calculate the percentages. We leverage transform’s ability to maintain length, along with groupby to sort our data.
What slice of the “pie” do the golden ticket passengers make out?
23% of all income on the ship is from golden ticket sales.
What percentage of passengers own a golden ticket?
Only 6% of all passengers purchased a golden ticket.
Transform returns based on self, the equal length must be satisfied. Therefore, transform can’t handle aggregate methods (sum, mean, std deviation, etc…)
Apply doesn’t take in multiple aggregations (one column at a time), while agg can.
Transform is best used to create a new entry into a table to see fine detail.
Aggregate and apply are useful at calculating a single summary value.
That’s it now, you’re ready to tackle future problems in data science. Using your newfound knowledge I suggest modifying the steps to calculate what percentage of golden ticket holders survive, as your next step in familiarizing yourself with these core AI concepts. As always, stay tuned for future guides where we’ll go over more topics ranging from joining datasets to deploying a machine learning model to the Cloud.
I’ve got a Golden Ticket! (Source South Park)