Definitions

ARtool mines association rules in binary databases.

Binary databases are databases whose attributes can take only two values. For example, in a database of supermarket transactions the attributes could be the items sold by the supermarket and the values they can take could be present or absent, according to whether the items were or were not involved in transactions. Each row of the database would represent an individual transaction. Because the first applications of association rules were made on supermarket data, binary attributes are historically called items and rows may also be called transactions. If an item is present in a row we say that the row contains the item.

A set of items is also called itemset. The support of an itemset is the fraction of the rows of the database that contain ALL of the items in the itemset.

An association rule is composed of two parts, the antecedent and the consequent, and is usually denoted as:

antecedent -> consequent.

An association rule suggests that the presence of the antecedent in a row implies to some extent the presence of the consequent. To measure the extent of this implication two measures are used:

  1. the support of the rule gives the fraction of the rows of the database that contain both the antecedent and the consequent of the rule. The support of a rule tells us in how many instances (rows) the rule can be observed.

  2. the confidence of the rule tells us what percentage of the rows that contain the antecedent also contain the consequent of the rule. The confidence of a rule gives us an idea of the strength of the influence that the antecedent has on the presence of the consequent of the rule.
The association rule mining problem consists of finding all association rules existing in a database, having support and confidence greater than a minimum support value and, respectively, a minimum confidence value.

Itemsets that have support greater than the minimum support value are called frequent. Some authors also use the term large but I prefer to use this term to denote the maximal frequent itemsets.

Association rule mining is usually performed in two steps:

  1. finding (all) frequent itemsets
  2. generating (all) association rules starting from the frequent itemsets discovered in the precedent step
The word all is enclosed in parentheses because there are algorithms who are only looking for a representative subset of the set of frequent itemsets (this may be the set of closed itemsets, or the set of maximal frequent itemsets), and also there are algorithms that look only for a subset of association rules (which may consist of "non-redundant" or "interesting" associations). The first step is computationally intensive; depending on the data and the mining parameters this step could take hours, or even days, unless the application happens to run out of memory first.

The second step takes normally much less time than the first but can result in a large number of rules, in the order of thousands, tens of thousands, or even higher. Browsing through these rules and finding useful ones is another mining problem in itself. There are however algorithms who only generate rules that are interesting or non-redundant in the sense of some definition of interestingness or redundancy.

There are a number of other measures that can be used to determine the value of an association rule. We mention a number of them here: