GAClust is a Java Application for clustering categorical databases. The process of finding this clustering uses a genetic algorithm approach.
Categorical attributes are attributes with a discrete domain. An example of a categorical attribute is the attribute color, which can have the domain {red, blue, green}. Databases having only categorical attributes are named categorical databases.
Finding a clustering of a categorical database means grouping the set fo rows of the database into a number of groups called classes, according to some criteria. These classes form a partition of the set of rows; each row belongs to only one class and all the rows are asigned to classes.
Each attribute of the database generates a partition of the set of rows of the database. The rows having the same value of the attribute form a class of this partition. For example, the color attribute determines a partition with three classes in which all the rows having the value red for the attribute color belong to a class, all the rows having the value blue belong to another class, and respectively, all the rows having the value green belong to the last class of the partition.
The criterion used in the grouping process is related to the partitions determined by the attributes of the database and quantifies a certain relationship between these partitions. To measure the effectiveness of the clustering found by GAClust, we use the notion of generalied conditional entropy.
The classical notion of entropy and conditional entropy can be generalized using the notion of a generator.
Let be the set of reals. The k-dimensional simplex
is the set:
A function is concave on a
set
if
for and
.
The function f is sub-additive
(supra-additive) on R if
for .
Functions such as (the Gini index),
(the Shannon entropy),
, or
given by:
are examples of generators.
Since f is concave, it satisfies Jensen's inequality
, for every
, and this implies that the largest
value of the sum
is achieved if and only if
.
Note that Jensen's inequality implies that the largest value of the
f-entropy of the partition is
.
The conditional f-entropy
is the average value
of the specific f-impurity of the classes of the partition
relative to the partition
.
A dissimilarity on a set S is a function such that
,
d(x,x)=0 and d(x,y)=
d(y,x) for every
.
Using the generalized entropy and conditional entropy we defined a
dissimilarity between two partitions. The dissimilarity d is
said to be definite if for all ,
d(x,y)=0 implies x=y. If a
dissimilarity satisfies the triangular inequality
for
, then we obtain the familiar notion of metric on
S. If f is a generator, then the mapping
given by
for
,
is clearly a definite dissimilarity on
. When
is close to
, meaning that their
classes have many elements in common, then both
and
are close to 0, so
is close
to 0.
To measure the quality of the clustering of a database we designed the
following measures that apply to each class of the partition
associated with the clustering. Let the partition associated with a
chromosome
K be and the partition
associated with an attribute
be
. We use the following class fitness measures:
The smaller the value of the class measure the better is the class
with respect to the classes of the attribute partitions. The class
measure favors classes that fit
very well inside at least one class of each attribute partition.
favors classes with smaller average impurity with
respect to all attribute partitions.
favors
greater cardinality classes having also smaller average impurity with
respect to all attribute partitions. Finally,
prefers
large cardinality classes with small impurity, but the dependence on
the impurity is not linear like in the case of
, it is
exponential.
We propose the quantities summarized in the following table as
measures of the closeness between the partitions represented by the
chromosomes and the attribute partitions:
To assess the quality of the resulting partition, we compute a
quantity denoted as classification rate. We identify a
one-to-one mapping of the classes from the resulting partition and the classes from the reference partition
. At each step, we maintain a list of classes from
that have not been yet mapped to any class of
. Initially, this list contains all the classes from
. Then, for each class of
we compute
the cardinalities of its intersections with the classes in
, given by the number of tuples they have in
common. The pairs consisting of a class from
and an
available class from
, are processed in
decreasing order of their intersection's cardinality. Each time we
retain and map the pair yielding the largest number of tuples in
common. Once a class from
is mapped to a class
from
, it is removed from the list of available
classes. When the unique mapping between partitions has been
completed, the classification rate is computed as the fraction of all
tuples, represented by the sum of tuples shared by the classes mapped
to each other.