Feature Engineering - Split features with substrings

First published on February 10, 2022

Last updated at April 22, 2022

 

5 minute read

Felicia Kuan

Growth

TLDR

In this Mage Academy lesson on feature engineering, we’ll learn how we’d parse and extract contents of a column at a particular index for the purpose of simplifying or specifying the information in a separate column.

Outline

  • Examples of usage

  • Edge cases

  • Code

  • Magical no-code solution ✨🔮

Examples of usage

When leveraging machine learning models, it’s crucial that each column presents simple and clear information to speed up data visualization and maximize model accuracy. 

Depending on the type of model you are currently organizing data for, some columns contain data that have a fixed format. Making use of the formatting, you can use the index to extract a substring of information and save it into a new column for further analysis.

Here are some examples of when you’d utilize a fixed index and split features by substring:

  1. Save dollar value from a string “$

    2000

    ” in a new column by keeping everything but the first character

1
2
>>> "$2000"[1:]
'2000'

2. Get the last 3 characters in a file name to extract file extensions “presentation.

ppt

1
2
>>> "presentation.ppt"[-3:]
'ppt'

3. Extract the middle 2 characters from a date (2020-

01

-22) to get the month → substring [5:7]

1
2
>>> "2020-01-22"[5:7]
'01'

Edge cases

4. What happens when the start and end indices are the same?

1
2
>>> "mage academy"[2:2]
''

5. In Python, if the start index is larger than the end, we also capture nothing.

1
2
>>> "churn prediction"[3:1]
''

Code

A gentle reminder that substring indexing in Python starts reading from the whitespace before the first character.  

Imagine a ruler– if we just wanted to extract the “$” in the string above, you’d capture it using the logic substring [0:1], and if you wanted just the numerical values, you’d use the purple indices, substring [1: 5].

As for shorthand, substring [1: ] means from index 1 to end. Substring [ : ] means start to end, so you’d end up with the full string.

Since we will be extracting substrings from each row of a specific column, it will be useful to go over how to select data from a list of strings at a predetermined index. 

Below, I’ve created a sample list of one day’s sales at Sephora:

Observing the string list, since there’s always a dollar sign at the front and we want a numerical value, we decide to simply omit the first character of each entry and keep the rest.

Therefore, if the full string is “$200” and starts at index 0, we keep everything from index 1 and afterwards, which contains the “200.”

link of gist

Then, we get the result:

1
[200.0, 239.99, 110.0, 150.0, 300.0]

Also, the sales are in the hundreds because Sephora is pricey.

👁️👄👁️

As per the behavior of the split function, we have now created a new list “sales_num” that contains numerical prices. If this were a dataset, 

sales_num

would be a new column/feature. 

While analyzing the “day.csv” file in the rental bike sharing dataset by Yam Peleg on 

Box

, let’s say we just wanted the month from the column “dteday” that contains the date.

psst: “dteday” is third from the left 😉

Since the data in this column is all formatted the same, if we wanted just the month in a new column, we need to access the two digits denoting the month in the “YYYY-MM-DD” format, at indexes 5 and 6. Thus, we’d use the slice notation “substring[5:7],” since the second parameter in the bracket indicates where to stop. 

Then, I used the function 

reindex

to reorder the newly made “month” column to the right of “dteday” so that we can check whether the correct characters have been extracted. Also, please conveniently ignore that this dataset already contains a “mnth” column. 

Magical no-code solution ✨🔮

On those days when you’d rather experience our Mage UI than coding, you can turn to the “Add column” using “substrings” column transformation feature on 

Mage

. The following are some examples of how you can split a column by substring.

Using the same rental bike dataset and column as the above example, we can extract the year from the column in the format “YYYY-MM-DD” by indexing from the start to the last digit of the year.

We can extract a substring for the month like explained in the above code using Mage as well!

Since the type of the “dteday” column is date/time, if we just wanted the time for each row, we'd want to extract the last 15 characters.

Want to learn more about machine learning (ML)? Visit 

Mage Academy

! ✨🔮