Association Rule Mining : Apriori and Eclat Algorithm Implementation in R Language


Introduction

Data Mining, otherwise known as Knowledge Discovery in Databases (KDD), refers to the process of discovering interesting patterns, trends, anomalies, and other valuable knowledge from a large repository of data. Data mining uses statistical and mathematical techniques to extract useful patterns and knowledge that exist in data. The different data sources include databases, data warehouses, information repositories, the web, etc.  There are various methods used for Data Mining such as association rules, clustering, classification, prediction, sequential patterns, and regression. 

This article focuses primarily on Apriori and Eclat algorithms, a famous data mining algorithm to generate association rules. We will first begin with the definition of association rule mining, and then understand the concepts of the apriori and eclat algorithms, its R implementations, and finally end with the advantages and disadvantages of the Apriori algorithm.

Assocation Rule Mining

Association rule mining is a rule-based data mining technique to identify associations, correlations, and interesting patterns in large databases. The main idea is to find relationships between the data in the dataset and to discover rules.

The most common and popular application of association rule mining is market basket analysis, where the database containing customer transactions is mined to discover the buying patterns. The generated association rules define customers' behaviour in buying items, i.e., it identifies the relationship between the different items that customers often buy together.

Association Rules

Association rule mining is a conventional and rule-based method for determining interesting relations between attributes in large databases. It is intended to discover all hidden associations or patterns that satisfy some user-predefined criteria. 

Association rules are simple If/Then statements that help discover relationships between the data in the data set. An association rule will be of the form {A => B}, where A and B define a set of attributes referred to as Antecedent and Consequent, respectively. 

Consider a below transaction dataset
Market basket analysis
Example: If{Bread, Butter} => Then{Jam}

Antecedent: The ‘If’ part is referred to as antecedent which is an item or set of items that is found in data. It forms the Left Hand Side (LHS) of the rule.

Consequent: The 'Then' part is referred to as consequent which is an item or set of items that is likely to found in combination with the antecedent. It forms the Right Hand Side (RHS) of the rule.

The above example states that, if Bread and Butter are bought then, the customers are also likely to buy Jam. Also, both the antecedent and consequent are disjoint in nature. The following measures are used to select potentially relevant rules in all possible rules.

Rule Evaluation Metrics

Support and Minimum Support 

The support for a rule (A=>B) indicates how frequently the antecedent and the consequent (If/Then) appear together in the dataset.

The minimum support is a user-specified threshold value that acts as a minimum criterion that will be used to exclude or filter rules in the result that have a support value less than the given minimum support.

 Mathematically, we can define support as:

Support metric in association rule mining

Confidence and Minimum Confidence : 

The Confidence for a rule (A=>B) indicates the number of times a given rule is found to be true. It measures the reliability of the inference made by an association rule. The values of confidence range between 0 and 1.

The minimum confidence is a user-specified threshold value that acts as a minimum criterion that will be used to prune or filter uninteresting rules in the result that have a confidence value less than the given minimum confidence.

Mathematically, we can define support as:

Confidence metric in association rule mining

The association rule mining can be viewed as a two step process

  1. Find Frequent Itemsets : An Itemset is said to be frequent if it satisfies the predetermined minimum support threshold value. That is, Itemsets whose support value is greater than or equal to the minimum support.
  2. Generate strong association rules from the frequent itemsets : Association rules are said to be strong if it complies with both the minimum threshold values of support and confidence.

Types of Association Rule Mining Algorithm

Association rule mining can be implemented using three algorithm. They are

  1. Apriori
  2. Eclat
  3. FP - Growth
In this article, I will show you how to implement the Apriori and Eclat algorithm using R programming language in RStudio.


Apriori

The Apriori algorithm is a widely used algorithm for mining frequent itemsets and discovering potential association rules from the given dataset. It is considered an influential unsupervised learning approach since it is used to mine hidden patterns or associations in the given dataset without any target variable for the model. 

Apriori algorithm employs breadth first search strategy to discover frequent itemsets. It is based on an important property that states that 

All non-empty subsets of a frequent itemset must also be frequent. In other words, if an itemset is infrequent, then all its superset are also infrequent.


How Apriori works

  1. Find all Frequent Itemsets
    1. Join step : This step is also known as candidate generation which generates (K+1) itemsets from K- Itemsets by joining with itself. For example, 1-itemset {{a},{b},{c}} is made to self join with itself to form 2-itemset {a,b},{a,c},{b,c}.
    2. Prune step : During this step, the Antimonotone property is applied to reduce the search space of candidate itemsets. As per the antimonotone property, if the support value of an itemset is less than the minimum support threshold, then all of its supersets will also be less than the minimum support. Thus, it can be pruned.
  2. Generate strong association rules : Using the frequent itemsets, select rules which satisfies minimum support and minimum confidence threshold. Such rules are called strong rules.

Eclat 

Eclat stands for Equivalence Class Clustering and bottom-up Lattice Traversal, and it is a popular algorithm for mining frequent if-then patterns in the dataset. Eclat algorithm uses a Depth First Search approach to discover rules, which is usually faster than a Breadth First Search. The use of the DFS strategy consumes less memory than Apriori. 

Eclat algorithm can perform mining only on vertical transactional databases. Even if the database is in horizontal format, we need to transform it into vertical format for mining. This vertical approach makes Eclat execute faster compared to Apriori, which uses a horizontal data format for generating frequent itemsets. Another advantage of using vertical dataset is that it avoids repeated scanning of the dataset to calculate the support value. Eclat only deals with support metric for mining itemsets.

Apriori Algorithm Implementation in R Language

In this tutorial, we will use  RStudio IDE to implement association rule mining Apriori algorithm. 


Dataset Description

The dataset used here is the "groceries" dataset which is available as pre-loaded in R. Run data() command from the terminal to view list of built-in datasets. 

data() R datasets

Required Libraries

The package "arules" in R provides infrastructure for analyzing transactional data using frequent itemsets and association rules. Similarly, the other packages like "arulesviz", "coefplot", "ggplot2" and "plotly" are used to visualize the generated rules in a graphical way. To Install and load the packages, run the following code fron the terminal.


# To install packages
install.package("arules")
install.package("arulesviz")
install.package("plotly")
install.package("ggplot")
install.package("coefplot")

# To load installed packages
library("arules")
library("arulesViz")
library("plotly")
library("ggplot2")
library("coefplot")

# To import "Groceries" dataset
data("Groceries")

Apply apriori() function

The function "apriori()"is an in-built function in R which takes transactional dataset and parameters list as input and generates association rules as output. 

Parameter list
1. Support (supp) : Minimum support threshold.
2. Confidence (conf) : Minimum confidence thereshold.
3. Minimum length (minlen) : Minimum length of the rules.
4. Maximum length (maxlen) : Maximum length of the rules.



# Run apriori function and store the rules in "myrules"
myrules <- apriori(Groceries,parameter = list(supp = 0.03,conf = 0.1,minlen=2))

# To inspect first 10 rules
inspect(myrules[1:10])

# To sort rules based on lift value
rules=sort(myrules, by="lift")

# To summarize
summary(rules)


apriori algorithm to mine rules

inspect first 10 rules

Apply subset() function

The function "subset()" is used to filter rules that matches with the expression given in the subset.

# To select only rules with rhs = "Whole milk"
srules=(subset(myrules,subset=rhs %in% "whole milk"));
inspect(srules)


find subset of rules

Visualize the rules

# Visual representation of rules
plot(srules,jitter=0)
plot(srules,method="two-key plot")
plot(srules,method="graph", engine = "htmlwidget")
plot(srules,engine="plotly")


Eclat Algorithm Implementation in R Language

The Eclat algorithm is found in the package "arules" of R system. Run the following code to install and load packages.

# To install packages
install.package("arules")
install.package("arulesviz")
install.package("plotly")
install.package("ggplot")
install.package("coefplot")

# To load installed packages
library("arules")
library("arulesViz")
library("plotly")
library("ggplot2")
library("coefplot")

# To import "Groceries" dataset
data("Groceries")


Apply eclat() function

The eclat() function returns an object of class itemsets as output. This algorithm uses simple intersection operations for equivalence class clustering along with bottom-up lattice traversal.

# Use eclat to find itemsets
itemsets <- eclat (Groceries, parameter = list(supp = 0.03, tidLists = TRUE, minlen = 2))



eclat algorithm to find itemsets


Apply ruleInduction() function

The ruleInduction() function can be used to generate association rules from the found itemsets. 

# ruleInduction to generate rules from found itemsets
rules <- ruleInduction(itemsets,transactions=Groceries, confidence = 0.1,reduce = FALSE, verbose = FALSE)

# To view generated rules
inspect(rules)

To select only rules with rhs = "Whole milk"
srules=(subset(rules,subset=rhs %in% "whole milk"));
inspect(srules)



ruleInduction to generate rules from found itemsets

Subset of rules

Visualize the rules

# Visual representation of rules
plot(srules,jitter=0)
plot(srules,method="two-key plot")
plot(srules,method="graph", engine = "htmlwidget")
plot(srules,engine="plotly")

Post a Comment

0 Comments