ARtool mines association rules in binary databases.
Binary databases are databases whose attributes can take only two values. For example, in a database of supermarket transactions the attributes could be the items sold by the supermarket and the values they can take could be present or absent, according to whether the items were or were not involved in transactions. Each row of the database would represent an individual transaction. Because the first applications of association rules were made on supermarket data, binary attributes are historically called items and rows may also be called transactions. If an item is present in a row we say that the row contains the item.
A set of items is also called itemset. The support of an itemset is the fraction of the rows of the database that contain ALL of the items in the itemset.
An association rule is composed of two parts, the antecedent
and the consequent, and is usually denoted as:
An association rule suggests that the presence of the antecedent in a row implies to some extent the presence of the consequent. To measure the extent of this implication two measures are used:
Itemsets that have support greater than the minimum support value are called frequent. Some authors also use the term large but I prefer to use this term to denote the maximal frequent itemsets.
Association rule mining is usually performed in two steps:
The second step takes normally much less time than the first but can result in a large number of rules, in the order of thousands, tens of thousands, or even higher. Browsing through these rules and finding useful ones is another mining problem in itself. There are however algorithms who only generate rules that are interesting or non-redundant in the sense of some definition of interestingness or redundancy.
There are a number of other measures that can be used to determine the value of an association rule. We mention a number of them here:
support(AC) - support(A) * support(C)
This measure was introduced by Gregory Piatetsky-Shapiro. If we think of the support of an itemset as approximating the probability of that itemset appearing in a row of the table, then the Piatetsky-Shapiro measure tells us how much the actual occurrence of AC differs from the expected value. Note the this measure is symmetric, so the measure will have the same values for A->C and C->A.
support(AC) / (support(A) * support(C))
The lift is a measure used in statistics that tells us how much the presence of the antecedent influences the appearance of the consequent. Like Piatetsky-Shapiro, it is a symmetrical measure.
support(AC) / support(A) - support(C)
I have come up with this measure while implementing ARtool. The influence is derived from the lift, however, unlike the lift and the Piatetsky-Shapiro measures, it is asymmetrical.