* Trying Apriori alogrithm with generating frequent itemsets and explain how it works.
Due to more practical explanation, I am going to use Grocery Store Data Set
This dataset contains 11 items : JAM, MAGGI, SUGAR, COFFEE, CHEESE, TEA, BOURNVITA, CORNFLAKES, BREAD, BISCUIT and MILK.
!pip install -U scikit-learn
!pip install mlxtend
## Necessary libraries imported
import numpy as np
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv(r'C:\Users\Arda\Downloads\GroceryStoreDataSet.csv',names=["products"],header=None)
data.head(10)
products | |
---|---|
0 | MILK,BREAD,BISCUIT |
1 | BREAD,MILK,BISCUIT,CORNFLAKES |
2 | BREAD,TEA,BOURNVITA |
3 | JAM,MAGGI,BREAD,MILK |
4 | MAGGI,TEA,BISCUIT |
5 | BREAD,TEA,BOURNVITA |
6 | MAGGI,TEA,CORNFLAKES |
7 | MAGGI,BREAD,TEA,BISCUIT |
8 | JAM,MAGGI,BREAD,TEA |
9 | BREAD,MILK |
data.values
array([['MILK,BREAD,BISCUIT'], ['BREAD,MILK,BISCUIT,CORNFLAKES'], ['BREAD,TEA,BOURNVITA'], ['JAM,MAGGI,BREAD,MILK'], ['MAGGI,TEA,BISCUIT'], ['BREAD,TEA,BOURNVITA'], ['MAGGI,TEA,CORNFLAKES'], ['MAGGI,BREAD,TEA,BISCUIT'], ['JAM,MAGGI,BREAD,TEA'], ['BREAD,MILK'], ['COFFEE,COCK,BISCUIT,CORNFLAKES'], ['COFFEE,COCK,BISCUIT,CORNFLAKES'], ['COFFEE,SUGER,BOURNVITA'], ['BREAD,COFFEE,COCK'], ['BREAD,SUGER,BISCUIT'], ['COFFEE,SUGER,CORNFLAKES'], ['BREAD,SUGER,BOURNVITA'], ['BREAD,COFFEE,SUGER'], ['BREAD,COFFEE,SUGER'], ['TEA,MILK,COFFEE,CORNFLAKES']], dtype=object)
Apriori is a popular algorithm for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.
## We can transform it into the right format via the TransactionEncoder as follows:
transact=TransactionEncoder()
te_data=transact.fit(data).transform(data)
transact.columns_
df=pd.DataFrame(te_data,columns=transact.columns_)
df
A | B | C | D | E | F | G | I | J | K | L | M | N | O | R | S | T | U | V | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | True | False | True | True | True | False | False | False | False | False | False | False |
1 | True | True | False | True | True | False | False | False | False | False | False | False | False | False | True | False | False | False | False |
2 | False | True | True | False | False | False | False | True | False | False | False | False | False | False | False | True | True | True | False |
3 | True | True | False | True | True | False | False | False | False | False | False | False | False | False | True | False | False | False | False |
4 | False | False | False | False | False | False | False | True | False | True | True | True | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
61 | False | False | False | False | True | False | True | False | False | False | False | False | False | False | True | True | False | True | False |
62 | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True | False | False |
63 | False | False | False | False | False | False | False | True | False | True | True | True | False | False | False | False | False | False | False |
64 | False | False | True | False | True | True | False | False | False | False | False | False | False | True | False | False | False | False | False |
65 | True | False | True | False | True | True | False | False | False | True | True | False | True | True | True | True | False | False | False |
66 rows × 19 columns
a = apriori(df, min_support=0.38, use_colnames=True)
a.sort_values(ascending=False, axis=0,by='support')
support | itemsets | |
---|---|---|
1 | 0.606061 | (E) |
0 | 0.560606 | (A) |
2 | 0.439394 | (R) |
3 | 0.393939 | (E, A) |