Definitions

GAClust is a Java Application for clustering categorical databases. The process of finding this clustering uses a genetic algorithm approach.

Categorical attributes are attributes with a discrete domain. An example of a categorical attribute is the attribute color, which can have the domain {red, blue, green}. Databases having only categorical attributes are named categorical databases.

Finding a clustering of a categorical database means grouping the set fo rows of the database into a number of groups called classes, according to some criteria. These classes form a partition of the set of rows; each row belongs to only one class and all the rows are asigned to classes.

Each attribute of the database generates a partition of the set of rows of the database. The rows having the same value of the attribute form a class of this partition. For example, the color attribute determines a partition with three classes in which all the rows having the value red for the attribute color belong to a class, all the rows having the value blue belong to another class, and respectively, all the rows having the value green belong to the last class of the partition.

The criterion used in the grouping process is related to the partitions determined by the attributes of the database and quantifies a certain relationship between these partitions. To measure the effectiveness of the clustering found by GAClust, we use the notion of generalied conditional entropy.

The classical notion of entropy and conditional entropy can be generalized using the notion of a generator.

Let tex2html_wrap_inline578 be the set of reals. The k-dimensional simplex is the set:
displaymath562

A function tex2html_wrap_inline582 is concave on a set tex2html_wrap_inline584 if
displaymath563
for tex2html_wrap_inline586 and tex2html_wrap_inline588.

The function f is sub-additive (supra-additive) on R if
displaymath564
for tex2html_wrap_inline588.


 definition544

Functions such as tex2html_wrap_inline604 (the Gini index), tex2html_wrap_inline606 (the Shannon entropy), tex2html_wrap_inline608, or tex2html_wrap_inline610 given by:
displaymath565
are examples of generators.

Since f is concave, it satisfies Jensen's inequality tex2html_wrap_inline614, for every tex2html_wrap_inline616, and this implies that the largest value of the sum tex2html_wrap_inline618 is achieved if and only if tex2html_wrap_inline620.


definition552

Note that Jensen's inequality implies that the largest value of the f-entropy of the partition tex2html_wrap_inline672 is tex2html_wrap_inline674. The conditional f-entropy tex2html_wrap_inline678 is the average value of the specific f-impurity of the classes of the partition tex2html_wrap_inline682 relative to the partition tex2html_wrap_inline672.

A dissimilarity on a set S is a function tex2html_wrap_inline688 such that tex2html_wrap_inline690, d(x,x)=0 and d(x,y)= d(y,x) for every tex2html_wrap_inline696.

Using the generalized entropy and conditional entropy we defined a dissimilarity between two partitions. The dissimilarity d is said to be definite if for all tex2html_wrap_inline696, d(x,y)=0 implies x=y. If a dissimilarity satisfies the triangular inequality tex2html_wrap_inline706 for tex2html_wrap_inline708, then we obtain the familiar notion of metric on S. If f is a generator, then the mapping tex2html_wrap_inline714 given by tex2html_wrap_inline716 for tex2html_wrap_inline718, is clearly a definite dissimilarity on tex2html_wrap_inline720. When tex2html_wrap_inline672 is close to tex2html_wrap_inline682, meaning that their classes have many elements in common, then both tex2html_wrap_inline678 and tex2html_wrap_inline728 are close to 0, so tex2html_wrap_inline730 is close to 0.

To measure the quality of the clustering of a database we designed the following measures that apply to each class of the partition associated with the clustering. Let the partition associated with a chromosome K be tex2html_wrap_inline736 and the partition associated with an attribute tex2html_wrap_inline738 be tex2html_wrap_inline740. We use the following class fitness measures:


tabular277

The smaller the value of the class measure the better is the class with respect to the classes of the attribute partitions. The class measure tex2html_wrap_inline758 favors classes that fit very well inside at least one class of each attribute partition. tex2html_wrap_inline760 favors classes with smaller average impurity with respect to all attribute partitions. tex2html_wrap_inline762 favors greater cardinality classes having also smaller average impurity with respect to all attribute partitions. Finally, tex2html_wrap_inline764 prefers large cardinality classes with small impurity, but the dependence on the impurity is not linear like in the case of tex2html_wrap_inline762, it is exponential.

We propose the quantities summarized in the following table as measures of the closeness between the partitions represented by the chromosomes and the attribute partitions:

tabular308

To assess the quality of the resulting partition, we compute a quantity denoted as classification rate. We identify a one-to-one mapping of the classes from the resulting partition tex2html_wrap_inline672 and the classes from the reference partition tex2html_wrap_inline820. At each step, we maintain a list of classes from tex2html_wrap_inline820 that have not been yet mapped to any class of tex2html_wrap_inline672. Initially, this list contains all the classes from tex2html_wrap_inline820. Then, for each class of tex2html_wrap_inline672 we compute the cardinalities of its intersections with the classes in tex2html_wrap_inline820, given by the number of tuples they have in common. The pairs consisting of a class from tex2html_wrap_inline672 and an available class from tex2html_wrap_inline820, are processed in decreasing order of their intersection's cardinality. Each time we retain and map the pair yielding the largest number of tuples in common. Once a class from tex2html_wrap_inline820 is mapped to a class from tex2html_wrap_inline672, it is removed from the list of available classes. When the unique mapping between partitions has been completed, the classification rate is computed as the fraction of all tuples, represented by the sum of tuples shared by the classes mapped to each other.