ARtool User's Guide
The graphical user interface
If you have read the Introduction to Association Rules section,
then you should be able to figure out quickly how to use the
interface. In this section you will learn a little more about the
finer details.
Initially you have to choose a database to work on. This is done on
the Database tab. Once you have selected a database, you can
start mining frequent itemsets or association rules from the other two
tabs. The Database tab also gives you information about the
characteristics of the database that you have selected.
ARtool breaks the mining process into two steps: mining the frequent
itemsets and generating the association rules. In what follows we will
refer to these steps as the first and second mining step.
If you want to perform the first mining step, just go to the
Frequent Itemsets tab, select an algorithm and a minimum
support value (a value greater than 0 and less or equal than 1), and
press Go. If the algorithm takes too much time, more than you
care to wait, then you can press Abort to stop the mining
process. Note that it might take some time from when you press Abort
to when the mining process is actually aborted.
Since the second mining step needs the results of the first step (the
frequent itemsets), these are saved in a cache file, normally
having the same name as the database but with an extension of
.cache. If you want to read the contents of a previously
generated cache file, simply select algorithm Use Cache on the
Frequent Itemsets tab. This will be faster than using an
algorithm which will always regenerate the cache.
PITFALLS:
- If you abort the mining process, the cache file that results is
incomplete. Using it further would result in incomplete results!
- The cache files are not cumulative, if you generate a cache file
for minimum support 0.1, and later regenerate it for 0.5, the cache
file will not contain the frequent itemsets with supports in interval
[0.1, 0.5).
For the second mining step, you need to select the minimum support
value on the Frequent Itemsets tab, and then select an
algorithm and a minimum confidence value (a value > 0 and <= 1) on
the Association Rules tab. The Go and Abort
buttons work in the same manner as those from the Frequent
Itemsets tab. The mining algorithm that you selected will start by
reading the cache file built during the first mining step. It is
therefore important to have such a cache file.
PITFALLS:
- If the cache file has not been created, the second mining step
cannot be performed.
- If the cache file is incomplete due to an aborted mining process,
or if it has been generated for a support greater than the one
currently selected by the user, the results of the second mining step
will be incomplete and possibly also incorrect.
- The minimum support value used in the second mining step is the
value from the Frequent Itemsets tab. If that value is invalid,
the second mining step cannot be performed.
The log window at the bottom of the ARtool frame displays status and
error messages, so it is useful to always keep an eye on it. You can
clear the log window if you wish by selecting the Program/Clear
log menu item.
The results of the two mining steps are presented in tables and they
can be ordered ascendingly or descendingly on each column by
double-clicking the column headers. The columns containing itemsets
are sorted according to the size of the itemset. You can also
double-click on a table row to display its contents in the log window,
which is useful when the itemset is too large to be displayed entirely
in the table.
You can evaluate the rules discovered using various measures by going
to the Program/Compute measure menu entry.
If you want to free some memory, then you can clear the result tables
by selecting either Discard itemsets or Discard rules
from the Program menu.
The Generate a synthetic database menu item allows you to
create a synthetic database. For more info about this you should check
reference [1a]. Note that if you generate a database with the
same name as an existing database, the existing database will be
overwritten.
The command line tools
The command line tools are easy to use, just execute each one of them
with no parameters to get usage instructions.
Below is a list of the command line tools along with a brief
description of each of them:
- minedb - mines association rules in a database.
- gendb - generates a database, useful for seeing how an algorithm
behaves on databases of various parameters.
- dbtool - used to perform various operations on a database, like
checking its integrity, setting a new description, displaying
information, etc.
- diffcache - performs a comparison between two cache files. It's
useful when debugging a mining algorithm, to see whether it produces
the same output as an algorithm you trust.
- db2asc - converts a .db file to a text format,
.asc.
- asc2db - converts a .asc text file to .db
format.
One thing to note is that all command line tools require that the name
of the database files passed to them do not include the extension. The
.db extension is appended automatically.
There is a tutorial for the asc2db and db2asc tools
in the file 0ASC_TUTORIAL.TXT. These utilities are useful if
you want to convert data to the .db format.
The Java packages
All classes from the Java packages laur.dm.ar,
laur.rand, and laur.tools contain documentation
comments, so you should use javadoc to generate the packages
documentation.
I will present here an overview of the classes from the
laur.dm.ar package, which is the package containing the
association rule mining algorithms:
- Itemset and AssociationRule encapsulate the
respective concepts. They implement CriteriaComparable so that
they can be sorted on various criteria.
- CriteriaSorter is a class that sorts
CriteriaComparable objects.
- DBReader and DBWriter are the classes that allow
reading and writing to a .db file. Inside their sources you
can find a description of the .db binary file
format. DBException is the exception thrown by these classes.
- DBCacheReader and DBCacheWriter are the classes
that allow reading and writing to a .cache file. These files
contain serialized Itemsets objects.
- HashTree and SET are two data structures used by
the mining algorithms. SET is a prefix tree and is used by
the algorithms implementing the second mining step to store and
retrieve the frequent itemsets found in the first mining
step. SETException is the exception that is thrown by
SET methods.
- SyntheticDataGenerator does what its name says and is
built using the classes from laur.rand.
- FrequentItemsetsMiner and AssociationsMiner are
abstract classes that are extended by all first mining step algorithms
and, respectively, second mining step algorithms. They extend
laur.tools.AbortableThread.
- Apriori, Closure, ClosureOpt, and
FPgrowth implement the frequent itemset mining algorithms
with the same name. ClosureOpt is a slightly optimized
version of Closure.
- AprioriRules, CoverRules, and
CoverRulesOpt implement the association rule mining
algorithms with the same name. CoverRulesOpt is an optimized
version of CoverRules.