GAClust User Manual
Working with GAClust involves two steps:
- First, select an existing database or generate a synthetical
database.
- Then, start the clustering process.
Selecting an existing database
To select an existing database use the Database section. In
this section you can type the name of the database directly in the
text field or you can use the Browse button to navigate
through the directories. If the database it not in the current
directory (the directory from where you launched the GAClust
application), then you will have to type the full name of the database
including the path to the database file or use the Browse
button.
Next, you will have to specify the number of rows and the number of
attributes to be considered from the database. If you selected a
database containing 1000 rows and 20 attributes, but you specified
only 100 rows and 10 attributes, then GAClust will work only with the
first 10 attributes and the first 100 rows of the selected database,
in the order in which they appear in the database. If you specify a
number of rows or attributes greater than the real values that are
available, you will see a warning message on the log area at the
bottom of the application and GAClust will work with the total number
of rows and columns available for the selected database.
Cliking on the Open button will open the database and informations
about the attributes and their values are displayed on a table. The
first column contains the attribute number and the second column
contains the classes of the partition determined by the attribute and
in parentheses their cardinalities. For example, opening the
zoo.data database from the UCI Machine Learning Repository,
asking for 101 rows and 17 attributes will generate the following
table:
Attribute 0 | 2(58) 1(43) |
Attribute 1 | 2(20) 1(81) |
Attribute 2 | 2(59) 1(42) |
Attribute 3 | 2(60) 1(41) |
Attribute 4 | 2(24) 1(77) |
Attribute 5 | 2(36) 1(65) |
Attribute 6 | 2(45) 1(56) |
Attribute 7 | 2(40) 1(61) |
Attribute 8 | 2(18) 1(83) |
Attribute 9 | 2(21) 1(80) |
Attribute 10 | 2(8) 1(93) |
Attribute 11 | 2(17) 1(84) |
Attribute 12 | 6(1) 5(2) 4(10) 3(27) 2(23) 1(38) |
Attribute 13 | 2(75) 1(26) |
Attribute 14 | 2(13) 1(88) |
Attribute 15 | 2(57) 1(44) |
Attribute 16 | 7(5) 6(4) 5(8) 4(10) 3(20) 2(13) 1(41) |
The first attribute in the database, named Attribute 0, partitions the
set of rows into 2 classes. Class 1 has 43 rows and class 2 has 58
rows. The last attribute, Attribute 16, corresponds to the classes of
animals documented by this database and it has 7 classes: class 1
contains 41 mammals, class 2 contains 13 fish, class 3 contains 20
birds, class 4 contains 10 invertebrates, class 5 contains 8 insetcs,
class 6 contains 4 amphibians, and class 7 contains 5 reptiles.
For a database with attributes having a large number of classes in
their partition, the information about the classes and their
cardinalities might not fit entirely in the display area of the table
column. For this attributes, you can double-click on the column and
the full information will be displayed in the log area at the bottom
of the application.
Generate a synthetic database
To syhthetically generate a database use the Generate
section. In this section you will have to type a name for the database
to be generated or check the Use default naming check
box. The other parameters that should be specified are the number of
rows, number of attributes and number of classes in the attribute
partitions. All the partitions of the attributes will have the same
number of classes. If desired, these partitions can be generated to
have a majority class, that is a class with more elements than the
other classes. For this situation you will have to check the
Generate a large class check box.
When using a default naming, the name of the database will be in the
form DBr100a5k2s123 for example, for a database with 100 rows,
5 attributes, 2 classes in the attribute partitions for which the seed
123 was used for the random number generator. In the situation in
which the database was generated to have a majority class the name of
the database will be in the form DBr100a5k2s123L for the same
values as in the previous example.
Clicking on the Generate button will start the generation of
the database. The partitions associated with the attributes of this
database are generated as variations of a certain partition denoted as
reference partition. The reference partition is generated
randomly and is saved in the last attribute of the database. Given
that N is the number of rows and k is the number of
classes, the generation of the attribute partitions follows the
pattern: for each row number rowid in [1, N], a
number i in [1, k] is randomly generated and saved
at position rowid in the reference partition and in all
attribute partitions, but one. The exception attribute, randomly
chosen, receives at position rowid a different value
j in [1, k], with j different than
i. To ensure that the reference and attribute partitions have
exactly k classes, the first values for i are 1,
2, ... k.
After the database is generated you can see the resulting database in
a table with two columns. The first column displays the name of the
attribute and the second value displays the classes and in parantheses
their cardinalities of the partition associated to the attribute. If
the information in the second column does not fit entirely in the
table column, double-click on it and the full information will be
displayed in the log area at the bottom of the application.
Once the generation process was started by clicking on the
Generate button, you can stop this process by clicking on the
Abort button and the database will be truncated to the number
of rows that have been generated to far. Next to the Abort
button you can see the a progress bar indicating the progress of the
generation process.
Clustering
Selecting the clustering parameters
The parameters that you should specify for the clustering algorithm
are the following:
- The population size represents the number of chromosomes
in the population of the genetic algorithm.
- The number of classes in the maximal number of classes in
the clustering searched by the genetic algorithm. For some
combinations of crossover and mutation methods the clustering algorithm
might converge tovards a partition with less classes than the value
specified here.
- The crossover rate is a fractional number in (0, 1),
representing the percent of the population size that will be used to
generate a new offspring by crossover.
- The mutation rate is a fractional number in (0, 1),
representing the percent of the population size that will undergo
mutations. For example if the population size is 100 chromosomes, the
crossover rate is 0.8, and the mutation rate is 0.1, then 80
chromosomes will be use for crossover and 10 chromosomes will be
mutated. The sum between the crossover and mutation rates should be
less than 1 because GAClust copies the best (1 - crossover rate -
mutation rate)x(population size) chromosomes directly to the new
population.
- The fitness threshold has the following significance. If
the objective is to minimize the fitness values and the fitness of the
best chromosome in the current population drops below the fitness
threshold, then the clustering process stops and produces this
chromosome as the final clustering. For maximization of the fitness
values the clustering process stops when the fitness value is greater
than the fitness threshold.
- The consecutive iterations represent the number of
consecutive iterations with no improvement in the fitness of the best
chromosome, after which the clustering process stops.
- The random seed is a number used as a seed for the random
number generator.
- The crossover type selects the crossover method. The
choices are:
- Random This method works by randomly selecting a number
r between 1 and N (the total number of rows) and by exchanges
the positions r to N between the two chromosomes. For example we are
clustering 10 into 2 classes and the two chromosomes are:
r = 4
|
V
C1: 1122122111
C2: 1111222122
and the selected value is r = 4, then the final chromosomes will be:
C1: 1121222122
C2: 1112122111
- X random classes This method works by randomly generating
a class number and exchanging this class between the two parent
chromosomes. The elements from one chromosome that belong to the class
to be exchanged and do not belong in this class in the other
chromosome will be assigned to a different class, randomly
generated. Considering the same situation as in the previous case, if
the two chromosomes are:
assigned to class 2
| |
V V
C1: 1122122111
C2: 1111222122
^^
||
assigned to class 2
and the class to exchange is 1, then the resulting chromosomes are:
C1: 1111222122
C2: 1122222111
For this example, the elements that need to be reassigned are moved to
class 2 because they can not belong to class 1 and we have only two
classes, but in general the class to whom they will belong is
generated randomly, and this selection continues until is selected a
class different than the class to exchange.
- X best class This method work by selecting the best class
in the partition determined by the two chromosome and by exchanging
these two classes between the chromosome. To quantify the goodness of
a class from the clustering associated with a chromosome we use the
notion of class fitness described in the section Definitions. This method uses a unique
encoding of the partition, denoted by list of class rowidsthat
is a list of the rowids belonging to each class and the order in which
this list appear is given by the increasing order of the first rowid
belonging to each class.
For example, if the two chromosomes are:
C1: 1121222122
C2: 1112122111
Then, their list of class rowids will be:
C1: [1 2 4 8][3 5 6 7 9 10]
C2: [1 2 3 5 8 9 10][4 6 7]
because for the first chromosome the rowids 1, 2, 4 and 8 belong to
class 1 and the remaining rowids belog to class 2.
If the best class of C1 is 1 and the best class of
C2 is 2, then class [1 2 4 8] from the first chromosome is
copied into the second chromosome, and the elements 3, 5, 9, 10,
previously belonging to class 1 in the second chromosome will be
randomly distributed to other classes.
The final offsprings after the replacement of classes and distribution
of elements to other classes, might be:
C1: [1 2 4 3 5][4 6 7 9 10]
C2: [1 2 4 8 10] [3 5 6 7 9]
- X best worst class This method works by exchanging the
best and worst class between two chromosomes. The best class from the
first chromosome replaces the worst class from the second chromosome
and vice versa.
- X random class (LR) This method works by exchanging a
random class between two chromosomes. It is similar to the method
X random class with the difference that it uses the list of
class rowids representation.
- The mutation type selects the mutation method. The
choices are:
- random This methods radomly selects 10% of the number of
rows to be assigned to another randomly selected class.
For example if the chromosome to undergo mutation is:
moved to class 1
|
V
C: 1121222122
one element will be moved, and if the selected rowid is 3, then after
mutation the chromosome might be:
C: 1111222122
- move elements This mehtod works by moving 10% of the
number of rows from a randomly selected class into another randomly
selected class. This method uses the list of class rowids
representation.
For example, if the chromosome to undergo mutation is:
moved to class 1
|
V
C: [1 2 4 8][3 5 6 7 9 10]
one element will be moved, and if the selected value is 3 then after
mutation the chromosome might be:
C: [1 2 3 4 8][5 6 7 9 10]
- swap elements This method works by swapping 10% of the
number of rows from between randomly selected classes. This method
uses the list of class rowid representation of the partition
associated with the chromosome.
For example, if the chromosome to undergo mutation is:
C: [1 2 4 8][3 5 6 7 9 10]
one element will be swapped between classes 1 and 2, and if the
selected values are 4 from class 1 and 6 from class 2, then after
mutation the chromosome might be:
C: [1 2 3 6 8][4 5 7 9 10]
- The distribution type selects the method of distributing
the necessary elements for the crossover methods that use the list of
class rowids representation (X best class, X best worst class, X
random class (LR)). The choices are:
- no distribution This should be the selection for the
following crossover methods that do not use the list of class rowids
representation (random, X random class).
- randomThis could be the selection for the crossover
methods that use the list of class rowids representation ((X best
class, X best worst class, X random class (LR))); The elements
that need to be redistributed to other classes are distributed
randomly.
- distribute to even This could be the selection for the
crossover methods that use the list of class rowids representation
((X best class, X best worst class, X random class (LR)));
The elements that need to be redistributed to other classes are
distributed to classes probabilistically, where the classes with
smaller cardinalities have a greater chance of being selected to
receive these elements.
- The class fitness type select the method of computing the
fitness of classes in the partition associated with the
chromosome. The choices are:
- symmetical difference
- product
- division
- power
See the Definitions section for
more details about these measures.
- The entropy measure selects the type of generator used in
the computation of the generalized entropy and conditional
entropy. The choices are:
- H(attribute|clustering)
- H(clsutering|attribute)
- Both
- Both scaled
- Normalizing weights
- Exclude attributes
- Alternate Havg
- Alternate
- Module
- Q
- L
- QR
- LR
- Q+QR
See the Definitions section for
more details about these measures. For the measures Normalizing
weights and Exclude attributes you have to use a target
and specify the target id (a number between 0 and one less than the
total number of attributes) and also you have to use weights and
specify a percent of the original database to be selected as a sample
database. For the measure Exclude attributes you have to
select also the exclusion of attributes and to specify the number of
attribute to be considered.
- The optimization type selects between minimization or
maximization of the chromosomes fitness values. For the fitness
measures Q, L, QR, LR, Q+QR the optimization mehtod should be
maximization, otherwise it should be minimization.
Clustering results
The clustering results are presented in table in the GA tab. Depending
on the existance or not of a reference partition for the synthetic
databases or the existance or not of a target attribute the results
are presented in two forms.
- We did not use a target attribute. In this situation the following
informations are presented:
- H(clustering) = the entropy of the clustering partition
- The fitness of the clustering partition
- The sum of the quantity H(clustering|attribute) +
H(attribute|clustering) computed for all attributes in the
database. This measure gives an indication about how similar is the
clustering partition with respect to the attribute partitions
- Classes and their cardinalities (specified in parentheses) for the
clustering partition
- Number of iterations performed by the genetic algorithm
- Time in seconds
- We did use a target attribute. In this situation the following
information are presented:
- H(reference) = the entropy of the reference partition (first column)
- H(clustering) = the entropy of the clustering partition (second column)
- The fitness of the reference partition (first column)
- The fitness of the clustering partition (second column)
- Classes and their cardinalities (specified in parentheses) for the
reference partition (first column)
- Classes and their cardinalities (specified in parentheses) for the
clustering partition (second column)
- The sum of the quantity H(reference|attribute) +
H(attribute|reference) computed for all attributes in the
database. This measure gives an indication about how similar is the
reference partition with respect to the attribute partitions (first column)
- The sum of the quantity H(clustering|attribute) +
H(attribute|clustering) computed for all attributes in the
database. This measure gives an indication about how similar is the
clustering partition with respect to the attribute partitions (second column)
- Number of iterations performed by the genetic algorithm (first column)
- Time in seconds (second column)
- Class of the reference partition (first column)
- How does this class intersects the classes from the clustering partition; class from clustering (its cardinality) [intersection count] (second column)
- Classification rate (second column)
The clustering process can be stopped by clicking on the
Abort button. The nature of the results, whether they are
final or partial, is posted on the log area at the right of the
application.
Examples of using GAClust
Following we included several runs of the GAClust on the
zoo.data database from the UCI Machine Learning repository,
documenting 101 animals characterized by 17 attributes (the last
attribute specifying the class of animals).
The parameters for the genetic algorithms were selected as follows:
- Population size = 100
- Number of classes = 7
- Crossover rate = 0.8
- Crossover type = random
- Distribution type = none
- Mutation rate = 0.1
- Mutation type = random
- Class fitness type = symmetrical difference
- Entropy type = GINI
- Fitness threshold = 0.0001
- Consecutive iterations with no improvement = 100
- Random seed = 1
- Optimization type = minimization
- Target id = 16
For the fitness measure Normalizing weights we used a sample
database of 40% of the original database. For the Exclude
attributes fitness measure the number of remaining attributes was
set to 10.
Fitness measure | Classification rate | No. Iterations | Time (sec) |
H(attribute|clustering) | 0.53 | 1157 | 334 |
H(clustering|attribute) | 0.41 | 747 | 186 |
Both | 0.61 | 1276 | 368 |
Both scaled | 0.74 | 900 | 316 |
Normalizing weights | 0.70 | 795 | 222 |
Exclude attributes | 0.41 | 1102 | 190 |
Alternate Havg | 0.55 | 1009 | 291 |
Alternate | 0.51 | 1168 | 330 |
Module | 0.51 | 1168 | 324 |
For the following fitness measure the optimization method was set to
maximization and the fitness threshold to 10000.
Fitness measure | Classification rate | No. Iterations | Time (sec) |
Q | 0.78 | 981 | 279 |
L | 0.70 | 862 | 242 |
QR | 0.78 | 1199 | 354 |
LR | 0.80 | 801 | 261 |
Q+QR | 0.77 | 737 | 239 |
The results can be improved by increasing the number of consecutive
iterations without improvement in the fitness value. This will trigger
an increase in the time to run the algorithm also. The user will have
to find a suitable balance between the accuracy of the results and the
time to run the genetic algorithms. Also, due to the randomness of the
process, many runs should be done, varying the random seed and the
best result among these runs should be considered.