Guide to model training: Part 4 - Ditching datetime

First published on December 2, 2021

 

4 minute read

Nathaniel Tjandra

Growth

TLDR

Apply feature engineering by converting time series data to numerical values for training machine learning models. 

Outline

  • Recap

  • Before we begin

  • The datetime data type

  • Converting to date

  • What’s next?

Recap

In our series so far, we've gone over scaling data to prepare for model training. We started with a dataset filled with categorical and numerical values and scaled them so that a computer could understand them. For the remainder of our dataset, we're almost ready to begin model training; we just need to scale our dates.

Before we begin

In this section, we’ll be revisiting the datatypes of numerical and categorical values. Please read 

part 1

and 

part 2

before proceeding if you’re unfamiliar with those terms.  We’ll be using the same 

big_data

dataset used throughout the model training guides.

Importance of dates

When collecting data to feed into machine learning models, it's common to have data on when a user signed up. The model can use this information to find hidden correlation between users. Maybe there was a sign-up bonus or event for users when creating an account. The data would reflect on the success and failure and would be considered when reviewing the model.

Dates are important and critical to success, especially when collaborating across different locations or countries. Dates can be written in so many ways, across multiple time zones, so the internet agreed on a standard to be used, under ISO 8601, last updated in 2019. It simplifies dates into what's known as the datetime format, to represent dates using numerical values to begin formatting.

The datetime data types

Our dates are formatted as 2021-11-30 as an example. It follows a year, month, day format. But when you think about what data type it is, it's hard to say for sure. A computer thinks of it as an object or string at first. But when humans look at it, it's obviously a number. So what is the actual data type?

In Pandas, there is a 

to_datetime

function that will convert the datatype to a 

datetime

value. This usually requires a formatter that specifies how to parse the input by year, month, day, day of week, month name, hour, minute, second, and even account for 12 hour time or time zones. Datetimes in Pandas follow the 

strftime

format used in UNIX.

Datetime abbreviations and outputs cheat sheet

(Source: 

DevHints

)

Converting dates

In our current dataset we have one datetime value, 

Dt_Customer

,

 

logged when a user first signs up for an account. Upon inspection, it’s a string or object data type.

Looking at the output, we see 21-08-2021, which shows that it is in month, day, year format. By comparing with the cheatsheet, to format it we’ll match it with 

%d-%m-%Y

.

The output standard is YYY-MM-DD

But we aren't completed yet. Even though we have it in datetime format, machines still cannot understand it. To finish off the conversion, we'll break down the datetime into their own columns for year, month, and day.

The datetime format must follow the ISO, and contain functions that allow it to parse specific portions. For Pandas we’ll be using the 

dt.year

dt.month

, and 

dt.day

methods.

Once we are sure that the values match, let’s remove the original column so the dataset contains only machine readable values.

What’s next

Now that all of our data has been modified to be so simple that a computer can understand and generate models. Throughout the series we've covered scaling data, filling in missing values, and now converting to datetime. For our finale, we'll take all of our finished datasets from parts 1 thru 4, and combine them together to begin training a classification model for remarketing on whether we should send or not send another email to our customers.

Start building for free

No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.

No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.

 2022 Mage Technologies, Inc.
 2022 Mage Technologies, Inc.