Model Training - K fold cross validation

First published on February 14, 2022

Last updated at September 28, 2022

 

6 minute read

Felicia Kuan

Growth

TLDR

In this Mage Academy lesson on model training, we’ll learn how to split our data for training and testing our machine learning models with K Fold Cross Validation.

Glossary

  • Definition

  • Conceptual example

  • How to code

  • Magical no-code solution ✨🔮

Definition

K-fold cross-validation is a data partitioning technique which splits an entire dataset into k groups. Then, we train and test k different models using different combinations of the groups of data we just partitioned, and use the results from these k models to check the model’s overall performance and generality.

In the context of machine learning, a 

fold

is a set of rows in a dataset. We will use k-folds to describe a number of groups we decide to partition the data, so in an example of 20 rows, we can split them into 2 folds with 10 rows each, 4 folds with 5 rows each, or 10 folds with 2 rows each.

A simple explanation of how 

k-fold cross validation

scores a model’s performance is:

  • The entire dataset is randomly split into equally-sized, independent 

    k-folds

    , without reusing any of the rows in another fold.

  • We use 

    k-1

     

    folds

    for model training, and once that model is complete, we test it using the remaining 1 fold to obtain a score of the model’s performance.

  • We repeat this process 

    k times

    , so we have 

    k number

    of models and scores for each.

  • Lastly, we take the mean of the 

    k number

    of scores to evaluate the model’s performance.

Conceptual example

To improve your understanding 

twice-fold 

😏, consider this analogy about k-fold cross validation with 

Twice

, a K-pop girl group. Say we are trying to see how well a 

model

can dance by inviting different subsets of Twice girls (called 

folds

) as training and test samples.

Source: Twice Official Twitter

If the entire dataset has 9 girls, which are our data points, then we need to manually choose how many folds to split our data into. I’m going with 3 for our example, but there are strategies to 

pick the best k

.

Since we need an equal amount of data in each fold, we randomly pick 3 girls from Twice for each of the three folds, with no overlaps:

With these 3 folds, we will train and evaluate 3 models (because we picked k=3) by training it on 2 folds (k-1 folds) and use the remaining 1 as a test. We pick different combinations of folds for the 3 models we’re evaluating.

Model 1: Trained on Fold 1 + Fold 2, Tested on Fold 3

Model 2: Trained on Fold 2 + Fold 3, Tested on Fold 1

Model 3: Trained on Fold 1 + Fold 3, Tested on Fold 2

The performance scores would get skewed if the same Twice girls who taught you how to dance were also your judges. So whichever six girls (data points) the model from, the remaining three girls would judge and score you. 

Now that you have 3 models and their scores, we can choose a model evaluation method (discussed in another lesson) to determine– generally– whether this model dances well. This is also to ensure that, in one metric, the opinions of all 9 judges/test samples are included.

The resulting evaluation metric would tell us whether we did a good job at dancing. So did we do a good job?

Nayeon is rooting for you!

How to code

Let’s try to evaluate how well a model learns to predict whether customers of a tourism company flake on their plans or not using Tejashvi’s 

dataset

. Maybe this model could tell us whether we’d follow through with our dreams of vacationing overseas this year, too?

1
2
3
4
import pandas as pd
df = pd.read_csv("Customertravel.csv")

df

Since scikit-learn takes numpy arrays, we’d first have to use Pandas to convert our data frame into a numpy array.

Then, we can use the “KFold” class to configure our evaluation. Our next step is to choose the amount of folds to split our rows of data into. Above, we can see that our dataset has 954 rows, which divides nicely into 9 folds with 106 rows of data each. 

This means we’d build and evaluate 9 models total, using 8 folds as training and 1 for scoring each. 

1
2
3
4
5
6
7
8
9
10
11
from sklearn.model_selection import KFold

# 2nd + 3rd param: shuffle data before splitting into folds
kfold = KFold(n_splits=9, shuffle=True, random_state=1)

model = 1
# displaying indices for the rows that will be for training/testing
for train, test in kfold.split(np_array):
  print('Model #%d:' % model)
  print('train: %s, test: %s' % (train, test))
  model = model+1

Now that we’re done splitting our data into 9 folds, we’re ready to continue onto the next lesson of evaluating the model!

Magical no-code solution ✨🔮

To skip all those configuration steps for K-fold cross validation, Mage provides an easy, no-code experience of training and testing a dataset. Although we, as users, aren’t able to customize how much of our data is split, Mage uses an algorithm to decide. For this dataset, Mage decided on approximately a 9:1 training to testing split.

You can find further details about the training/test split under “Review > Statistics” on our Mage web application.

We just launched our new

open source tool

for building and running data pipelines that transform you data!

Join our Slack channel!

Come, chat and collaborate with our

Mage slack community

.

Start building for free

No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.

No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.