In this Mage Academy lesson on feature engineering, we’ll learn how shifting is used when working with time to calculate the difference between consecutive rows.
Before we begin
Time-series data is a sequence of data points in chronological order used by businesses to analyze past data and make future predictions. Suppose you’re working with time-series data such as weather forecasting, the number of taxi rides per month or predicting a company’s stock prices for the next day. While working on this type of data you may come across a situation to use all the previous rows of a dataframe or calculate the difference between the consecutive rows of it. Other data analysis applications like “Excel” may seem like a good idea to compare rows with each record, but in practice the process is slow.
Before we begin:
The “shift” function is used to shift the rows and columns of a dataframe based on the parameters mentioned in the “shift” argument. The “shift” function has the following arguments:
It is an integer value to specify the number of periods to shift data. It can be both positive and negative.
This can be used with “date offset”, “tseries.offset”, “timedelta” or “str”. If the index is not set to time series then an error will be raised.
It is used to specify the direction in which you want to shift. It is 0 if you want to shift the index and 1 if you want to shift the column.
It is optional. It can be used to fill the newly introduced missing values with a value that is passed in.
In this guide, we’ll show all the above-mentioned arguments with the help of the “shift” function looking at the stock market. We analyzed 2 weeks of data to understand the pattern of stock fluctuations and are looking for the impact of the COVID-19 pandemic and the oil market crash of 2020 which has led to dramatic change in the relationships driving stock market returns.
Trades.csv- This file contains the stock prices for the dates. We are analyzing 2 weeks of data.
Like all other guides, we’ll be using Google Collab and start by importing those datasets into dataframes.
Previously we have covered the “range” function for scaling email marketing datasets.
Let’s understand the implementation of the “shift” function.
1 2 3import pandas as pd df=pd.read_cvs("Trades.cvs", parse_dates=['Dates'],index_col='Dates') df
We’ll “shift” the prices by one day as the market closing price for the first day becomes the market opening price for the next day.
1 2#shift prices by one day df.shift(periods=1)
Here the price of the stock for the first date is shifted by one row as in the “periods” argument we mentioned the value as “1”. Since there was no prior data available for the first date we get a “NaN” (Not A Number) when shifting with a positive step.
From the above fig. We represent the working of the “shift” function by specifying step size as “1”.
To get the price of the stock from the next date we mention a negative step size in the “shift” function of the “periods” argument.
1 2#shifting the values up df.shift(periods=-2)
The price of the stock for the last 2 dates is shifted up by 2 rows, as in the “periods” argument we mentioned the value as “-2”. The “shift” function moves the values up and down depending on the step size mentioned as positive or negative.
Lets store the price of the stocks for the previous day by shifting and storing it in a dataframe called “Prev_Price”.
1 2 3# previous day price of stocks df['Prev_Price'] = df['Price'].shift(1) df
We can use the stock prices of the previous day to determine the change in the price of the stock and analyze whether the company is making a profit or loss.
1 2 3#change in price for one day df['1day_Change'] = df['Price'] - df['Prev_Price'] df
Looking at the change column, we gain an insight about the stock prices profit or loss on the specific dates over the course of 2 weeks. The stocks had made a profit only on 2 dates, “2021-05-24” and “2021-05-27”, and lost much more than it gained.
To find the percentage of profit or loss in between business days, let’s look at the calculation below for 5 consecutive working days.
1 2df['Percent_return'] = (df['Price'] - df['Price'].shift(5))*100/df['Price'].shift(5) df
In the above calculation, we use shift to the difference between 2 prices of 2 different dates with a 5 day gap and divide it by the initial “price”. This result is stored in the “Percent_return” feature column after multiplying it by 100 to find the profit or loss percentage.
We can see that there are no prices available for the first 5 dates, from “2021-05-17” to “2021-05-23” after performing the shift(5) operation. Since there is no percentage of return available before 5 dates so it is mentioned, “NaN”. The percentage of return is calculated for each stock from the 5th day.
We found that the company had made a profit in only 1 day i.e. “2021-05-30” and there was no profit and no loss on “2021-05-27”.
Next, let’s take a look at the frequency of the trading done by looking at the price column.
1 2df = df[['Price]] df
From above, we understand that the dates “2021-05-21”, “2021-05-22”, “2021-05-28” and “2021-05-29” are missing. As they are not business dates and the market is closed on weekends.
By checking the index of the dataframe we see that the “freq” is “None”. So we set the index to time-series frequency. We’ll use the “range” function of “Pandas” to specify the “start date” and set the frequency to “12 Hours”.
1 2df.index = pd.date_range(start='2021-05-17',periods=11, freq='12H') df
To understand how the stock prices have changed during the day, we have to calculate the “Intraday' price” of the stock for the same date within the “12 hours” gap. “Intraday” trading is a form of trading in which the trader buys and sells the stocks within the same day.
1 2df['Intraday_Price'] = df['Price'].shift(1) df
Here we’ll shift the column “price” in the column “Intraday_price” to understand the change in the price of stocks during the day within a “12 hours” gap. Next we set the “axis” argument as “1” to shift the column, while 0 would shift the row. We’ll fill the values of the “price” column with “0”.
1df.shift(1, axis=1, fill_value=0)
Let’s perform a column level shift on “price” to calculate the “Intraday” stock prices for the same day using “tshift”. For this, we’ll specify the step size to shift the values, using the index frequency specified in the dataframe (column 1).
We deduce that the “date” column has been shifted by one row down since we specified the positive step size value in the “tshift” function. If we want to shift the “date” column one row up we have to specify a negative step size value in the “tshift” function.
Now, you’ve learned about how to use the “Pandas” “shift” method to shift rows in a dataframe up and down. For dealing with “time-series” data by transforming 2 weeks of stock market trader data. Similarly, you can do more with “shift” to set ranges that will analyze daily, monthly and yearly time-series data.
Mage also supports shifting data and we can use that data to build a machine learning model.
With Mage, we can shift the rows down with our intuitive no code UI. Simply select the column you want, and shift it by some number in a specified direction. You can also apply grouping to shift multiples at once if you were looking at multiple categories, such as stocks among different companies.
Below is a gif of the whole process for transforming your data using the “shift” function with Mage.