This article explain the basics of association rules and how to find them using Apriori algorithm. The algorithm is implemented in C# and Silverlight and a live demonstration is available below with full source code.
Apriori Algorithm Demo in Silverlight
The Apriori algorithm for finding large itemsets and generating association rules using those large itemsets are illustrated in this demo. The source code of the demo is available for download below.
What is Data Mining and Association Rules
Data Mining is a technique for discovering useful information from large databases. Analyzing the data and extracting useful information can be potentially very profitable to a business. For example, if a seller can find the association between two products, critical decisions in pricing or product placements can be made in order to promote the business. By this way, the seller can concentrate the marketing efforts on every subset of customers who are very likely to buy the associated products.
Association Rules are used for discovering regularities between products in big transactional databases. A transaction is an event involving one or more of the products (items) in the business or domain; for example buying of goods by a consumer in a super market is a transaction. A set of items is usually referred as "itemset", and an itemset with "k" number of items is called "k-itemset".
The general form of an association rule is X => Y, where X and Y are two disjoint itemsets. The "support" of an itemset is the number of transactions that contain all the items of that itemset; whereas the support of an association rule is the number of transactions that contain all items of both X and Y. The "confidence" of an association rule is the ratio between its support and the support of X.
A given association rule X => Y is considered significant and useful, if it has high support and confidence values. The user will specify a threshold value for support and confidence, so that different degrees of significance can be observed based on these threshold values.
For more information on association rules, see http://en.wikipedia.org/wiki/Association_rule.
Data Structures for this Article
We create the following data-structure classes. An Itemset is just a list of strings; it can be used to represent a transaction or any set of items. The ItemsetCollection is list of itemsets; it can be used to represent a transactional database or any group of itemsets. The AssociationRule class represents an instance of a generated association-rule.
public class Itemset : List<string>
{
public double Support { get; set; }
}
public class ItemsetCollection : List<Itemset>
{
}
public class AssociationRule
{
public Itemset X { get; set; }
public Itemset Y { get; set; }
public double Support { get; set; }
public double Confidence { get; set; }
}
Finding Large Itemsets using Apriori Algorithm
The first step in the generation of association rules is the identification of large itemsets. An itemset is "large" if its support is greater than a threshold, specified by the user. A commonly used algorithm for this purpose is the Apriori algorithm.
The Apriori algorithm relies on the principle "Every non-empty subset of a larget itemset must itself be a large itemset". The algorithm applies this principle in a bottom-up manner. Let Li denote the collection of large itemsets with "i" number of items. The algorithm begins by identifying all the sets in L1. Each item that has the necessary support forms a large 1-itemset and included in L1, other itemsets are dropped from consideration. This process of retaining necessary itemsets only is called "pruning". The set of itemsets used to find Li is called candidate itemsets (Ci).
The collection L2 can be constructed by considering each pair of sets in L1 and retaining only those pairs that has enough support. In general, having constructed Li, the collection Li+1 is constructed by considering pairs of sets, one from Li and another from L1 and eliminating those for which the support is smaller. This procedure is continued until all large itemsets up to the desired maximum size have been obtained or no further pruning is possible.
The following code implements the Apriori algorithm.
public static ItemsetCollection DoApriori(ItemsetCollection db, double supportThreshold)
{
Itemset I = db.GetUniqueItems();
ItemsetCollection L = new ItemsetCollection(); //resultant large itemsets
ItemsetCollection Li = new ItemsetCollection(); //large itemset in each iteration
ItemsetCollection Ci = new ItemsetCollection(); //candidate itemset in each iteration
//first iteration (1-item itemsets)
foreach (string item in I)
{
Ci.Add(new Itemset() { item });
}
//next iterations
int k = 2;
while (Ci.Count != 0)
{
//set Li from Ci (pruning)
Li.Clear();
foreach (Itemset itemset in Ci)
{
itemset.Support = db.FindSupport(itemset);
if (itemset.Support >= supportThreshold)
{
Li.Add(itemset);
L.Add(itemset);
}
}
//set Ci for next iteration (find supersets of Li)
Ci.Clear();
Ci.AddRange(Bit.FindSubsets(Li.GetUniqueItems(), k)); //get k-item subsets
k += 1;
}
return (L);
}
The FindSubsets() function defined in Bit class is used to find all the subsets of a given set of items. This is explained in more detail in this article.
Finding Association Rules
Having found the set of all large itemsets from the input database, the next task is to find the required set of strong association rules. An association rule is "strong" if its confidence value is greater than a user-defined threshold. The association rules are created by combining each large itemset with each of its subsets. The strong rules are published as result and others are dropped.
public static List<AssociationRule> Mine(ItemsetCollection db, ItemsetCollection L, double confidenceThreshold)
{
List<AssociationRule> allRules = new List<AssociationRule>();
foreach (Itemset itemset in L)
{
ItemsetCollection subsets = Bit.FindSubsets(itemset, 0); //get all subsets
foreach (Itemset subset in subsets)
{
double confidence = (db.FindSupport(itemset) / db.FindSupport(subset)) * 100.0;
if (confidence >= confidenceThreshold)
{
AssociationRule rule = new AssociationRule();
rule.X.AddRange(subset);
rule.Y.AddRange(itemset.Remove(subset));
rule.Support = db.FindSupport(itemset);
rule.Confidence = confidence;
if (rule.X.Count > 0 && rule.Y.Count > 0)
{
allRules.Add(rule);
}
}
}
}
return (allRules);
}
Downloads for this article
File |
Language |
Tools |
Apriori-Demo-Source 365.57 kb (2622 downloads) |
C#, Silverlight 3 |
Visual Studio 2010 |