How to reshape huge transactional datasets for building rule based machine learning models like Association Rule Mining using Apriori Algorithm ?

Source : @https://www.upgrad.com/

If you heard of the name Association Rule Data Mining , you would have come across Apriori Algorithm which generates rules based on three factors like Support , Confidence and Lift.

We are not going deep into what is association rule mining is and the specifics about it , but to explain it in a brief way , Association Rule Algorithm uses the frequency of products purchased in a transaction and based on the purchase history it tries to identify association between the products . If Product A , Product B are purchased together , there is a high likely that customer purchased Product C.

Note : Please go through Association Rule Mining for further clarifications

For building these rules ,we use an algorithm called as Apriori Algorithm . For the algorithm the data should be in a basket format as shown below

Transaction1 — [ Product A] , [ Product B] , [ Product D]

Transaction 2 — [ Product B] , [ Product C]

Or it can be in this way as well:

Fig — Desired

Here comes the problem when a beginner starts reading about Association Rule Mining using Apriori Algorithm , any publicly available datasets looks as follows :

Fig1

We can see in Fig1 that each row corresponding to the transaction made by single customer and the products billed in that transaction.

So the data is already available in the bucket format. Apriori algorithm works if the input dataset is of the format shown above.

But we generally record data at each product level transaction which looks like:

Fig2

So to convert this Data Frame from Fig2 to Fig1 we reshape the Data Frame by converting the rows to the columns

How it’s done ?

By using unstack() method available in Pandas . Click here to see the functionality of unstack()

What is Unstacking ?

Unstacking is a widely used functionality provided by the Pandas Library. It is a method used to pivot a level of index labels (in a certain hierarchical order). Given a Data Frame with multilevel index labels ,it returns a Data Frame having a new level of column labels whose inner-most level consists of the pivoted index labels | Source : Click Here

We will be discussing this with examples later .

But , Why should pandas use a method like Unstack ?

Generally, for any machine learning approach to follow , it should be analyzed and interpreted to get the meaningful results . The data given to us for any business problem will not be in a shape that can be easily understandable. For example the a business problem relating to Retail Stores will be recorded at transactional level and we need to work on a way to convert the transactional data to a more understandable way.

Unstack is one of the method which can achieve this easily. So ,It can be used when we are trying to reshape our Data Frame in a specific order either to analyze or to understand a particular data point in a better fashion.

How it is different from Pivot Table ?

Pivot Table internally use this unstack method to get the pivot done

Lets see this with an example . I am using the online retail store data . You can download from here

import pandas as pd 
import numpy as np
from pandas import DataFrame

Read the Dataset:

df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df.head()

We can see this dataset consists of transactional data and we need to aggregate it at a customer level as our objective is to find the lifetime value of Customer.

Now looking at the column data types :

df.info()

This is how the datatypes looks like .

And now we do preprocessing to get the month of the year from InvoiceDate column :

df['InvoiceDate'] = df.InvoiceDate.dt.date

We generate a column ‘TotalSales’ which is quantity multiplied with Unit Price

df ['TotalSales'] =  df['Quantity'] * df['UnitPrice']

Now we extract month in separate column

import datetime
df['Month_Id'] = pd.DatetimeIndex(df.InvoiceDate).month
df.head()

What if we are going to look at the monthly Total Sales across each country ?

We can simply do a pivot right !

df.pivot_table(index = 'Month_Id' , columns = 'Country' , values =['TotalSales'] , aggfunc = ['sum'])
Fig3

We can achieve the same using an unstack method.

Before actually going to that , we can try to understand the purpose of Unstack . So we use unstack when the lower index level is being converted to a column label .

Steps involved to achieve Pivot Result are :

1. To Aggregate the data at Month, Country level

2. Have Country as the lowest level of Index

3. Apply unstack on this

Step 1 & 2 :

df[['Month_Id','TotalSales','Country']].groupby(['Month_Id','Country'])['TotalSales'].sum()

We can see that Data has been grouped by Month , Country and aggregated as well. Also the lowest index is the Country.

Step 3:

df[['Month_Id','TotalSales','Country']].groupby(['Month_Id','Country'])['TotalSales'].sum().unstack()
Fig4

Looking at the two resultant sets Fig3 & Fig4 , we can see that we got the same results using both methods

The only difference here is that Pivot_Table can reshape the duplicate values in the Data Frame by using an aggregation function and unstack just converts the lower index rows to column labels.

Now what about Pivot method in Pandas ?

Pivot is used for the same functionality as Unstack. And in order to achieve aggregation for duplicate data points or data values we use pivot table .

Unstack is similar to Pivot functionality in pandas but pivot requires a index ,columns and the values to be specified . Unstack does that for us by taking the lower most index of the Data Frame.

Pivot works of series data as well which can be the output data after we do a group by.

We can Pivot or Unstack multiple values as well :

Here is an example of Unstacking two values TotalSales and Quantity as well

df[['Month_Id','TotalSales','Country','Quantity']].groupby(['Month_Id','Country'])['TotalSales','Quantity'].sum().unstack()

We can see that two index are created for both values (highlighted)

Reshaping the Online Retail Store Dataset using unstack():

Now we move on to generate the a shape which is seen in Fig — Desired which is :

Fig — Desired

And we are about to find the products ( Description Column ) for each Invoice generated for USA

Steps involved :

1. Filtering the USA data

2. Grouping the Transactional Data at Invoice and Description level

3. Unstack the Data

4. Fill Null values to Zeroes

5. Setting the Index as InvoiceNo

We can do all the steps in one go !

Here is the code for this:

(df[df['Country'] =="United Kingdom"]
.groupby(['InvoiceNo', 'Description'])['Quantity'].sum()
.unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

Output looks like this :

Note :

nunique() gives the unique records for the column. We need 0 or 1 value for knowing the purchase products across invoices

But , the purpose of this discussion is not just unstacking a small dataset . There is a second problem generally Analysts , Data Scientists face while generating Association Rules. If the size of the dataset is too huge then it takes lots of time and memory and also sometimes pandas can’t handle the huge shape of the transformed dataset and it throws error like the below :

ValueError: Unstacked DataFrame is too big, causing int32 overflow

But we use huge datasets in real life scenarios of solving any business problems !

What to do for such cases ?

After trying out different scenarios , I found one method which can be implemented without disturbing the flow of our approach.

Solution : Split the data into chunks and unstacking the chunks and combining the data at the end.

But there will be some problems if unstacked at chunk level.

We can take a simple case with dummy data.

Follow these steps below :

dic = { 'Invoice' : [], 'Product' : [] }dic['Invoice'].append(1) 
dic['Product'].append(['A','B','C','D','E','F','G'])
dic['Invoice'].append(2)
dic['Product'].append(['A','C','E','F','H','I'])
dic['Invoice'].append(3)
dic['Product'].append(['A','B','C','D'])
dic['Invoice'].append(4)
dic['Product'].append(['E','F','G','H','I'])
df_Sample = pd.DataFrame.from_dict(dic)
df_Sample.head()

Added the data in this format because we can also see a bit of how stack works here .

We need to convert each row of Product as a transaction and replicate the Invoice for 7 Products for Invoice 1 and so on.

Thsi can be achieved by following step by step process :

Step 1 : Converting the list of products to individual columns

df_Sample['Product'].apply(lambda x: pd.Series(x))

Step 2 : Using the stack method in Pandas which is the opposite of unstack and we can see what happens by doing the stack on the above dataframe

df_Sample['Product'].apply(lambda x: pd.Series(x)).stack()

Step 3 : Droping the index at level 0 and changing it to level 1 and convert that to a frame from series

df_Sample['Product'].apply(lambda x: pd.Series(x)).stack().reset_index(level=1, drop=True).to_frame('Product')

Step 4 :

Now we left join this with Invoice column to get replicated Invoices across each product

df_Sample['Product'].apply(lambda x: pd.Series(x)).stack().reset_index(level=1, drop=True).to_frame('Product').join(df_Sample[['Invoice']], how='left')

As we saw the process in detail , now we can convert the Data Frame to this shape and the code for this as below :

df_Sample = (df_Sample['Product'].apply(lambda x: pd.Series(x))
.stack()
.reset_index(level=1, drop=True)
.to_frame('Product')
.join(df_Sample[['Invoice']], how='left')
)
df_Sample = df_Sample[['Invoice','Product']]

Now , we come back to out actual problem where the dataset is too huge and we need to work on the chunks of dataset to unstack the dataset and combine it

Lets look at the original dataset unstacked :

df_sample.reset_index().groupby(['Invoice','Product'])['index'].count().unstack().fillna(0)
Fig5

Before going into the actual problem we are going to face , lets see how to create the chunks of data

#Code for getting the Chunks of Data 
Chunks = 2
n_size = round(df_sample.shape[0]/Chunks)
tot_size = round(df_sample.shape[0])
print(n_size,tot_size)

Note : For simplicity we are making two subsets of data

So lets Unstack the two subsets and see how it looks :

#Subset1
df_sample.iloc[0:11].reset_index().groupby(['Invoice','Product'])['index'].count().unstack().fillna(0)
Fig6
#Subset2
df_sample.iloc[11:22].reset_index().groupby(['Invoice','Product'])['index'].count().unstack().fillna(0)
Fig7

If you can observe the two datasets from Fig6 & Fig7, we find different set of columns and also the values are different for each Product — Invoice combination

So ,we can solve this problem by using Concat method and here is the code for that :

#The First Chunk
df_subset1 = df_sample.iloc[0:11].reset_index().groupby(['Invoice','Product'])['index'].count().unstack().fillna(0)
#The Second Chunk
df_subset2 = df_sample.iloc[11:22].reset_index().groupby(['Invoice','Product'])['index'].count().unstack().fillna(0)
pd.concat([df_subset1, df_subset2], axis=0).fillna(0).groupby('Invoice').max().reset_index()
Fig8

If you see Fig5 & Fig8, they are the same

Now we automate the process and the whole code is as below :

dic = { 'Invoice' : [], 'Product' : [] }dic['Invoice'].append(1) 
dic['Product'].append(['A','B','C','D','E','F','G'])
dic['Invoice'].append(2)
dic['Product'].append(['A','C','E','F','H','I'])
dic['Invoice'].append(3)
dic['Product'].append(['A','B','C','D'])
dic['Invoice'].append(4)
dic['Product'].append(['E','F','G','H','I'])
df_Sample = pd.DataFrame.from_dict(dic)
#Convert the dataset to required format
df_Sample = (df_Sample['Product'].apply(lambda x: pd.Series(x))
.stack()
.reset_index(level=1, drop=True)
.to_frame('Product')
.join(df_Sample[['Invoice']], how='left')
)
df_Sample = df_Sample[['Invoice','Product']]#Code for getting the Chunks of Data
Chunks = 2
n_size = round(df_sample.shape[0]/Chunks)
tot_size = round(df_sample.shape[0])
print(n_size,tot_size)
#Making the list of Shape for different dataframesshapeofChunks = [ ]
lower_bound = 0
for i in range(1,Chunks+1):

upper_bound = lower_bound + n_size - 1
shapeofChunks.append((lower_bound,upper_bound) )
lower_bound = n_size + lower_bound

if lower_bound>= tot_size:
upper_bound = tot_size-1
shapeofChunks.pop() # Try to remove this and see what happens on your Own
shapeofChunks.append((lower_bound-n_size ,upper_bound))


#append the dataframe subsets in list format
lists_df = []
for i in shapeofChunks:
lists_df.append(df_sample.iloc[i[0]:i[1]+1])
#Combining the list of DataFrame subsets into a combined set
df_Output = pd.DataFrame()
for df in lists_df:
df_basket = df.reset_index().groupby(['Invoice','Product'])['index'].count().unstack().fillna(0)
df_Output = pd.concat([df_basket, df_Output], axis=0).fillna(0).groupby('Invoice').max()
df_Output

Here , we make list of subsets of Data Frames and using that we loop through the data and get the final aggregated unstacked data by following the above process .

Note : This can be very helpful if you use a huge dataset .The same code can be replicated for any big dataset

Conclusion:

So, we discussed about unstack method in pandas and the use case where unstack can be frequently used. And a unique scenario when it will be difficult to use Unstack is explained and the solution is discussed for such scenario.

Also we saw the use case where stack in pandas can be used. And finally we automated the entire process to reshape large dataset to bucket format from transaction format.

Hope you gained some knowledge from this discussion and when you work on similar problem hope this discussion helps in achieving better results!