GAClust User Manual

Working with GAClust involves two steps:

  1. First, select an existing database or generate a synthetical database.
  2. Then, start the clustering process.

Selecting an existing database

To select an existing database use the Database section. In this section you can type the name of the database directly in the text field or you can use the Browse button to navigate through the directories. If the database it not in the current directory (the directory from where you launched the GAClust application), then you will have to type the full name of the database including the path to the database file or use the Browse button.

Next, you will have to specify the number of rows and the number of attributes to be considered from the database. If you selected a database containing 1000 rows and 20 attributes, but you specified only 100 rows and 10 attributes, then GAClust will work only with the first 10 attributes and the first 100 rows of the selected database, in the order in which they appear in the database. If you specify a number of rows or attributes greater than the real values that are available, you will see a warning message on the log area at the bottom of the application and GAClust will work with the total number of rows and columns available for the selected database.

Cliking on the Open button will open the database and informations about the attributes and their values are displayed on a table. The first column contains the attribute number and the second column contains the classes of the partition determined by the attribute and in parentheses their cardinalities. For example, opening the zoo.data database from the UCI Machine Learning Repository, asking for 101 rows and 17 attributes will generate the following table:

Attribute 0 2(58) 1(43)
Attribute 1 2(20) 1(81)
Attribute 2 2(59) 1(42)
Attribute 3 2(60) 1(41)
Attribute 4 2(24) 1(77)
Attribute 5 2(36) 1(65)
Attribute 6 2(45) 1(56)
Attribute 7 2(40) 1(61)
Attribute 8 2(18) 1(83)
Attribute 9 2(21) 1(80)
Attribute 102(8) 1(93)
Attribute 112(17) 1(84)
Attribute 126(1) 5(2) 4(10) 3(27) 2(23) 1(38)
Attribute 132(75) 1(26)
Attribute 142(13) 1(88)
Attribute 152(57) 1(44)
Attribute 167(5) 6(4) 5(8) 4(10) 3(20) 2(13) 1(41)

The first attribute in the database, named Attribute 0, partitions the set of rows into 2 classes. Class 1 has 43 rows and class 2 has 58 rows. The last attribute, Attribute 16, corresponds to the classes of animals documented by this database and it has 7 classes: class 1 contains 41 mammals, class 2 contains 13 fish, class 3 contains 20 birds, class 4 contains 10 invertebrates, class 5 contains 8 insetcs, class 6 contains 4 amphibians, and class 7 contains 5 reptiles.

For a database with attributes having a large number of classes in their partition, the information about the classes and their cardinalities might not fit entirely in the display area of the table column. For this attributes, you can double-click on the column and the full information will be displayed in the log area at the bottom of the application.

Generate a synthetic database

To syhthetically generate a database use the Generate section. In this section you will have to type a name for the database to be generated or check the Use default naming check box. The other parameters that should be specified are the number of rows, number of attributes and number of classes in the attribute partitions. All the partitions of the attributes will have the same number of classes. If desired, these partitions can be generated to have a majority class, that is a class with more elements than the other classes. For this situation you will have to check the Generate a large class check box.

When using a default naming, the name of the database will be in the form DBr100a5k2s123 for example, for a database with 100 rows, 5 attributes, 2 classes in the attribute partitions for which the seed 123 was used for the random number generator. In the situation in which the database was generated to have a majority class the name of the database will be in the form DBr100a5k2s123L for the same values as in the previous example.

Clicking on the Generate button will start the generation of the database. The partitions associated with the attributes of this database are generated as variations of a certain partition denoted as reference partition. The reference partition is generated randomly and is saved in the last attribute of the database. Given that N is the number of rows and k is the number of classes, the generation of the attribute partitions follows the pattern: for each row number rowid in [1, N], a number i in [1, k] is randomly generated and saved at position rowid in the reference partition and in all attribute partitions, but one. The exception attribute, randomly chosen, receives at position rowid a different value j in [1, k], with j different than i. To ensure that the reference and attribute partitions have exactly k classes, the first values for i are 1, 2, ... k.

After the database is generated you can see the resulting database in a table with two columns. The first column displays the name of the attribute and the second value displays the classes and in parantheses their cardinalities of the partition associated to the attribute. If the information in the second column does not fit entirely in the table column, double-click on it and the full information will be displayed in the log area at the bottom of the application.

Once the generation process was started by clicking on the Generate button, you can stop this process by clicking on the Abort button and the database will be truncated to the number of rows that have been generated to far. Next to the Abort button you can see the a progress bar indicating the progress of the generation process.

Clustering

Selecting the clustering parameters

The parameters that you should specify for the clustering algorithm are the following:

Clustering results

The clustering results are presented in table in the GA tab. Depending on the existance or not of a reference partition for the synthetic databases or the existance or not of a target attribute the results are presented in two forms. The clustering process can be stopped by clicking on the Abort button. The nature of the results, whether they are final or partial, is posted on the log area at the right of the application.

Examples of using GAClust

Following we included several runs of the GAClust on the zoo.data database from the UCI Machine Learning repository, documenting 101 animals characterized by 17 attributes (the last attribute specifying the class of animals). The parameters for the genetic algorithms were selected as follows: For the fitness measure Normalizing weights we used a sample database of 40% of the original database. For the Exclude attributes fitness measure the number of remaining attributes was set to 10.
Fitness measureClassification rateNo. IterationsTime (sec)
H(attribute|clustering)0.531157334
H(clustering|attribute)0.41747186
Both0.611276368
Both scaled0.74900316
Normalizing weights0.70795222
Exclude attributes0.411102190
Alternate Havg0.551009291
Alternate0.511168330
Module0.511168324

For the following fitness measure the optimization method was set to maximization and the fitness threshold to 10000.
Fitness measureClassification rateNo. IterationsTime (sec)
Q0.78981279
L0.70862242
QR0.781199354
LR0.80801261
Q+QR0.77737239

The results can be improved by increasing the number of consecutive iterations without improvement in the fitness value. This will trigger an increase in the time to run the algorithm also. The user will have to find a suitable balance between the accuracy of the results and the time to run the genetic algorithms. Also, due to the randomness of the process, many runs should be done, varying the random seed and the best result among these runs should be considered.