* Trying Apriori alogrithm with generating frequent itemsets and explain how it works.
Due to more practical explanation, I am going to use Grocery Store Data Set
This dataset contains 11 items : JAM, MAGGI, SUGAR, COFFEE, CHEESE, TEA, BOURNVITA, CORNFLAKES, BREAD, BISCUIT and MILK.
!pip install -U scikit-learn
!pip install mlxtend
## Necessary libraries imported
import numpy as np
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv(r'C:\Users\Arda\Downloads\GroceryStoreDataSet.csv',names=["products"],header=None)
data.head(10)
| products | |
|---|---|
| 0 | MILK,BREAD,BISCUIT |
| 1 | BREAD,MILK,BISCUIT,CORNFLAKES |
| 2 | BREAD,TEA,BOURNVITA |
| 3 | JAM,MAGGI,BREAD,MILK |
| 4 | MAGGI,TEA,BISCUIT |
| 5 | BREAD,TEA,BOURNVITA |
| 6 | MAGGI,TEA,CORNFLAKES |
| 7 | MAGGI,BREAD,TEA,BISCUIT |
| 8 | JAM,MAGGI,BREAD,TEA |
| 9 | BREAD,MILK |
data.values
array([['MILK,BREAD,BISCUIT'],
['BREAD,MILK,BISCUIT,CORNFLAKES'],
['BREAD,TEA,BOURNVITA'],
['JAM,MAGGI,BREAD,MILK'],
['MAGGI,TEA,BISCUIT'],
['BREAD,TEA,BOURNVITA'],
['MAGGI,TEA,CORNFLAKES'],
['MAGGI,BREAD,TEA,BISCUIT'],
['JAM,MAGGI,BREAD,TEA'],
['BREAD,MILK'],
['COFFEE,COCK,BISCUIT,CORNFLAKES'],
['COFFEE,COCK,BISCUIT,CORNFLAKES'],
['COFFEE,SUGER,BOURNVITA'],
['BREAD,COFFEE,COCK'],
['BREAD,SUGER,BISCUIT'],
['COFFEE,SUGER,CORNFLAKES'],
['BREAD,SUGER,BOURNVITA'],
['BREAD,COFFEE,SUGER'],
['BREAD,COFFEE,SUGER'],
['TEA,MILK,COFFEE,CORNFLAKES']], dtype=object)
Apriori is a popular algorithm for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.
## We can transform it into the right format via the TransactionEncoder as follows:
transact=TransactionEncoder()
te_data=transact.fit(data).transform(data)
transact.columns_
df=pd.DataFrame(te_data,columns=transact.columns_)
df
| A | B | C | D | E | F | G | I | J | K | L | M | N | O | R | S | T | U | V | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | True | False | True | True | True | False | False | False | False | False | False | False |
| 1 | True | True | False | True | True | False | False | False | False | False | False | False | False | False | True | False | False | False | False |
| 2 | False | True | True | False | False | False | False | True | False | False | False | False | False | False | False | True | True | True | False |
| 3 | True | True | False | True | True | False | False | False | False | False | False | False | False | False | True | False | False | False | False |
| 4 | False | False | False | False | False | False | False | True | False | True | True | True | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 61 | False | False | False | False | True | False | True | False | False | False | False | False | False | False | True | True | False | True | False |
| 62 | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True | False | False |
| 63 | False | False | False | False | False | False | False | True | False | True | True | True | False | False | False | False | False | False | False |
| 64 | False | False | True | False | True | True | False | False | False | False | False | False | False | True | False | False | False | False | False |
| 65 | True | False | True | False | True | True | False | False | False | True | True | False | True | True | True | True | False | False | False |
66 rows × 19 columns
a = apriori(df, min_support=0.38, use_colnames=True)
a.sort_values(ascending=False, axis=0,by='support')
| support | itemsets | |
|---|---|---|
| 1 | 0.606061 | (E) |
| 0 | 0.560606 | (A) |
| 2 | 0.439394 | (R) |
| 3 | 0.393939 | (E, A) |