How to use Spark and Pandas to prepare big data

May 12, 2021 · 14 minute read

Tommy Dang

Engineering

TLDR

If you want to train machine learning models, you may need to prepare your data ahead of time. Data preparation can include cleaning your data, adding new columns, removing columns, combining columns, grouping rows, sorting rows, etc.

(source: Lucas Films)

If you want to train machine learning models, you may need to prepare your data ahead of time. Data preparation can include cleaning your data, adding new columns, removing columns, combining columns, grouping rows, sorting rows, etc.

Once you write your data preparation code, there are a few ways to execute it:

  1. Download the data onto your local computer and run a script to transform it

  2. Download the data onto a server, upload a script, and run the script on the remote server

  3. Run some complex transformation on the data from a data warehouse using SQL-like language

  4. Use a Spark job with some logic from your script to transform the data

We’ll be sharing how

Mage

uses option 4 to prepare data for machine learning models.

Prerequisites

Apache Spark

is one of the most actively developed open-source projects in big data. The following code examples require that you have Spark set up and can execute Python code using the

PySpark

library. The examples also require that you have your data in

Amazon S3

(Simple Storage Service). All this is set up on

AWS EMR

(Elastic MapReduce).

We’ve learned a lot while setting up Spark on AWS EMR. While this post will focus on how to use PySpark with

Pandas

, let us know in the comments if you’re interested in a future article on how we set up Spark on AWS EMR.

(source: Nickelodeon)

Outline

  1. How to use PySpark to load data from Amazon S3

    Write Python code to transform data

  2. Write Python code to transform data

How to use PySpark to load data from Amazon S3

(source: History Channel)

PySpark

is “an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.”

We store feature sets and training sets (data used to store features for machine learning models) as CSV files in

Amazon S3

.

Here are the high level steps in the code

:

  1. Load data from S3 files; we will use CSV (comma separated values) file format in this example.

  2. Group the data together by some column(s).

  3. Apply a Python function to each group; we will define this function in the next section.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from pyspark.sql import SparkSession


def load_data(spark, s3_location):
    """
    spark:
        Spark session
    s3_location:
        S3 bucket name and object prefix
    """
    return (
        spark
        .read
        .options(
            delimiter=',',
            header=True,
            inferSchema=False,
        )
        .csv(s3_location)
    )

with SparkSession.builder.appName('Mage').getOrCreate() as spark:
    # 1. Load data from S3 files
    df = load_data(spark, 's3://feature-sets/users/profiles/v1/*')
    
    # 2. Group data by 'user_id' column
    grouped = df.groupby('user_id')

    # 3. Apply function named 'custom_transformation_function';
    # we will define this function later in this article
    df_transformed = grouped.apply(custom_transformation_function)

Write Python code to transform data

Here are the high level steps in the code:

  1. Define Pandas UDF (user defined function)

  2. Define schema

  3. Write code logic to be run on grouped data

Define Pandas UDF (user defined function)

Pandas

is “a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language”.

Pandas user-defined function (UDF)

is built on top of

Apache Arrow

. Pandas UDF improves data performance by allowing developers to scale their workloads and leverage Panda’s APIs in Apache Spark. Pandas UDF works with Pandas APIs inside the function, and works with Apache Arrow to exchange data.

1
2
3
4
5
6
7
8
9
from pyspark.sql.functions import pandas_udf, PandasUDFType


@pandas_udf(
    SCHEMA_COMING_SOON,
    PandasUDFType.GROUPED_MAP,
)
def custom_transformation_function(df):
    pass

Define schema

Using Pandas UDF requires that we define the schema of the data structure that the custom function returns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import (
    IntegerType,
    StringType,
    StructField,
    StructType,
)


"""
StructField arguments:
    First argument: column name
    Second argument: column type
    Third argument: True if this column can have null values
"""
SCHEMA_COMING_SOON = StructType([
    StructField('user_id', IntegerType(), True),
    StructField('name', StringType(), True),
    StructField('number_of_rows', IntegerType(), True),
])


@pandas_udf(
    SCHEMA_COMING_SOON,
    PandasUDFType.GROUPED_MAP,
)
def custom_transformation_function(df):
    pass

Write code logic to be run on grouped data

Once your data has been grouped, your custom code logic can be executed on each group in parallel. Notice how the function named

custom_transformation_function

returns a

Pandas DataFrame

with 3 columns:

user_id

,

date

, and

number_of_rows

. These 3 columns have their column types explicitly defined in the schema when decorating the function with the

@pandas_udf

decorator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import (
    IntegerType,
    StringType,
    StructField,
    StructType,
)


"""
StructField arguments:
    First argument: column name
    Second argument: column type
    Third argument: True if this column can have null values
"""
SCHEMA_COMING_SOON = StructType([
    StructField('user_id', IntegerType(), True),
    StructField('name', StringType(), True),
    StructField('number_of_rows', IntegerType(), True),
])


@pandas_udf(
    SCHEMA_COMING_SOON,
    PandasUDFType.GROUPED_MAP,
)
def custom_transformation_function(df):
    number_of_rows_by_date = df.groupby('date').size()
    number_of_rows_by_date.columns = ['date', 'number_of_rows']
    number_of_rows_by_date['user_id'] = df['user_id'].iloc[:1]

    return number_of_rows_by_date

Putting it all together

The last piece of code we add will save the transformed data to S3 as a CSV file.

1
2
3
4
5
6
7
(
    df_transformed.write
    .option('delimiter', ',')
    .option('header', 'True')
    .mode('overwrite')
    .csv('s3://feature-sets/users/profiles/transformed/v1/*')
)

Here is the final code snippet that combines all the steps together:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import (
    IntegerType,
    StringType,
    StructField,
    StructType,
)


"""
StructField arguments:
    First argument: column name
    Second argument: column type
    Third argument: True if this column can have null values
"""
SCHEMA_COMING_SOON = StructType([
    StructField('user_id', IntegerType(), True),
    StructField('name', StringType(), True),
    StructField('number_of_rows', IntegerType(), True),
])


@pandas_udf(
    SCHEMA_COMING_SOON,
    PandasUDFType.GROUPED_MAP,
)
def custom_transformation_function(df):
    number_of_rows_by_date = df.groupby('date').size()
    number_of_rows_by_date.columns = ['date', 'number_of_rows']
    number_of_rows_by_date['user_id'] = df['user_id'].iloc[:1]

    return number_of_rows_by_date


def load_data(spark, s3_location):
    """
    spark:
        Spark session
    s3_location:
        S3 bucket name and object prefix
    """
    
    return (
        spark
        .read
        .options(
            delimiter=',',
            header=True,
            inferSchema=False,
        )
        .csv(s3_location)
    )


with SparkSession.builder.appName('Mage').getOrCreate() as spark:
    # 1. Load data from S3 files
    df = load_data(spark, 's3://feature-sets/users/profiles/v1/*')
    
    # 2. Group data by 'user_id' column
    grouped = df.groupby('user_id')
    
    # 3. Apply function named 'custom_transformation_function';
    # we will define this function later in this article
    df_transformed = grouped.apply(custom_transformation_function)

    # 4. Save new transformed data to S3
    (
        df_transformed.write
        .option('delimiter', ',')
        .option('header', 'True')
        .mode('overwrite')
        .csv('s3://feature-sets/users/profiles/transformed/v1/*')
    )

Conclusion

This is how we run complex transformations on large amounts of data at

Mage

using Python and the Pandas library. The benefit of this approach is that we can take advantage of Spark’s ability to query large amounts of data quickly while using Python and Pandas to perform complex data transformations through functional programming.

You can use

Mage

to handle complex data transformations with very little coding. We run your logic on our infrastructure so you don’t have to worry about setting up Apache Spark, writing complex Python code, understanding Pandas API, setting up data lakes, etc. You can spend your time focusing on building models in

Mage

and applying them in your product; maximizing growth and revenue for your business.

Hang out with us

Join our community and chat about startups, AI/ML, and product development.

Like what you see? Join the team.

Mage is making AI and ML accessible to product developers. Join us and build beautiful and intuitive devtools.

Want to give us feedback or ask questions?

Please chat with us live by joining our Discord channel or send us an email.