DiCLENS: Divisive Clustering Ensemble with Automatic Cluster Number

Selim Mimaroglu, Emin Aksehirli

Department of Computer Engineering, Bahcesehir University, Ciragan Caddesi 34353 Besiktas, Istanbul, Turkey

Abstract

Motivation

Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions.

Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. We propose a novel method, DiCLENS, which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final clustering automatically and does not take any input parameters, a feature missing in many existing algorithms.

Results

Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DiCLENS produces very good quality clusterings in a short amount of time. DiCLENS implementation runs on standard personal computers by being scalable, and by consuming very little memory and CPU.

Contact:

selim DOT mimaroglu AT bahcesehir DOT edu DOT tr

Implementation

Obtain the application

Executable jar file for the application can be downloaded here. This jar file includes all the libraries, thefore you only need to have JRE. DiCLENS implementation should be used as follows.

On Ubuntu:

Right click the jar file and select "Open with OpenJDK Run Time" or "Open with SunJDK Run Time"

On Mac OS, MS Windows etc.

Type the following command:

java -jar diclens.jar

Usage of Application

Select the input file by clicking the Browse File button or manually entering the path of the input file. Click Run Algorithm button. DiCLENS will be run and panels wll be updated according to data set and results as follows:

  • First panel shows the Similarity based Minimum Spanning Tree (SMST).
  • Second panel shows the meta-clusters.
  • Final Clustering is shown at the bottom panel: Venn Diagram of final clusters, their corresponding data points as set notation and in comma separated notation which is same as the input file notation.

Contents of the input clusters can be checked by placing the mouse over each cluster. Edge weights, which are Inter Cluster Similarities, can be checked by the same method.

Screenshot of application

Input data sets

Application gets input clusterings from a comma separated file, where every line is a clustering and every comma separated number indicates the group of the corresponding data object. An example of input file is shown below.

1,1,2,2,3,3
3,3,2,2,1,1
1,1,1,2,2,2
1,1,N,N,N,2
Example input file.

In example input file there are 4 clusterings (partitions) for 6 data points. Every column, separated by commas, indicates the group of data point respect to the order from left. i.e. in first clustering data points (d1, d2 ... d6) are grouped like this: { d1, d2 } , { d3, d4 } , { d5, d6 }.

"N" as group identifier means that corresponding object is not grouped, i.e. is not clustered, in that clustering. So group identifier of this object for this clustering will be ignored. e.g. Identifiers of data points d3, d4 and d5 is N for the 4th clustering of example input file, so for these data points information from only three clusterings is used.

Lines starting with a # are ignored as well.

Input clusterings for Gene Expression data sets can be downloaded here.

Input clusterings for artificial data sets can be downloaded here.

Input clusterings for real data sets can be downloaded here.

Gene Expression Data Sets

Properties of Gene Expresion Data Sets

Filtered gene datasets can be found on the supplementary site of Clustering Cancer Gene Expression Data: a Comparative Study.
Dataset Name Array Type Tissue Total samples Num of classes Total Genes Selected # of Genes
Bladder carcinoma (Dyrskjot et al., 2003) Affymetrix Bladder 40 3 7129 1203
Breast Cancer (West et al., 2001) Affymetrix Breast 49 2 7129 1198
Breast-Colon tumors (Chowdary et al., 2006) Affymetrix Breast, Colon 104 2 22283 182
Carcinomas (Su et al., 2001) Affymetrix Multi-tissue 174 10 12533 1571
Central nervous system-1 (Pomeroy et al., 2002) Affymetrix Brain 34 2 7129 857
Central nervous system-2 (Pomeroy et al., 2002) Affymetrix Brain 42 5 7129 1379
Endometrial cancer (Risinger et al., 2003) Double Channel Endometrium 42 4 8872 1771
Glioblastoma multiforme (Liang et al., 2005) Double Channel Brain 37 3 24192 1411
Gliomagenesis (Bredel et al., 2005) Double Channel Brain 50 3 41472 1739
Gliomas-1 (Nutt et al., 2003) Affymetrix Brain 50 4 12625 1377
Gliomas-2 (Nutt et al., 2003) Affymetrix Brain 28 2 12625 1070
Gliomas-3 (Nutt et al., 2003) Affymetrix Brain 22 2 12625 1152
Hepatocellular carcinoma (Chen et al., 2002) Double Channel Liver 178 2 22699 85
Leukemia-1 (Yeoh et al., 2002) Affymetrix Bone Marrow 248 2 12625 2526
Leukemia-2 (Yeoh et al., 2002) Affymetrix Bone Marrow 248 6 4022 1095
Leukemia-3 (Armstrong et al., 2002) Affymetrix Blood 72 2 12582 1081
Leukemia-4 (Armstrong et al., 2002) Affymetrix Blood 72 3 12582 2194
Leukemia-5 (Golub et al., 1999) Affymetrix Bone Marrow 72 2 7129 1877
Leukemia-6 (Golub et al., 1999) Affymetrix Bone Marrow 72 3 7129 1877
Lung tumor-1 (Bhattacharjee et al., 2001) Affymetrix Lung 203 5 12600 1543
Lung tumor-2 (Garber et al., 2001) Double Channel Lung 66 4 24192 4553
Lymphoma-1 (Alizadeh et al., 2000) Double Channel Blood 42 2 4022 1095
Lymphoma-2 (Alizadeh et al., 2000) Double Channel Blood 62 3 4022 2093
Lymphoma-3 (Shipp et al., 2002) Affymetrix Blood 77 2 7129 798
Melanoma (Bittner et al., 2000) Double Channel Skin 38 2 8067 2201
Mesothelioma (Gordon et al., 2002) Affymetrix Lung 181 2 12533 1626
Multi-tissue (Ramaswamy et al., 2001) Affymetrix Multi-tissue 190 14 16063 1363
Prostate cancer-1 (Tomlins et al., 2007) Double Channel Prostate 104 5 20000 2315
Prostate cancer-2 (Tomlins et al., 2007) Double Channel Prostate 92 4 20000 1288
Prostate cancer-3 (Lapointe et al., 2004) Double Channel Prostate 69 3 42640 1625
Prostate cancer-4 (Lapointe et al., 2004) Double Channel Prostate 110 4 42640 2496
Prostate cancer-5 (Singh et al., 2002) Affymetrix Prostate 102 2 12600 339
Round blue-cell tumor (Khan et al., 2001) Double Channel Multi-tissue 83 4 6567 1069
Serrated carcinomas (Laiho et al., 2007) Affymetrix Colon 37 2 22883 2202

Generating Input Clusterings

Input clusterings for Gene Expression data sets can be downloaded here.

Data set name Method Selected features Number of Clusters in Clusterings Number of Clusterings Min ARI Max ARI Average ARI
Bladder carcinoma k-means 25% - 50% 2 - 6 10 0.18 0.64 0.39
Breast Cancer k-means 25% - 50% 2 - 7 10 0.08 0.42 0.25
Breast-Colon tumors k-means 25% - 50% 2 - 10 10 0.11 0.92 0.43
Carcinomas k-means 25% - 50% 2 - 13 10 0.10 0.63 0.42
Central nervous system-1 manual N/A 2 - 4 6 0.21 0.61 0.44
Central nervous system-2 k-means 25% - 50% 2 - 6 10 0.23 0.50 0.38
Endometrial cancer manual, random N/A 4 - 5 5 0.48 0.71 0.60
Glioblastoma multiforme k-means 75% - 85% 2 - 6 10 -0.03 0.46 0.18
Gliomagenesis k-means 25% - 50% 2 - 7 10 0.11 0.49 0.28
Gliomas-1 manual N/A 4 - 6 4 0.48 0.74 0.64
Gliomas-2 manual, random N/A 2 - 5 4 0.30 0.39 0.36
Gliomas-3 manual N/A 2 - 3 3 0.37 0.61 0.52
Hepatocellular carcinoma k-means 75% - 85% 2 - 13 10 0.10 0.70 0.40
Leukemia-1 k-means 75% - 85% 2 - 15 10 0.10 0.32 0.18
Leukemia-2 k-means 25% - 50% 2 - 15 10 0.14 0.23 0.20
Leukemia-3 manual N/A 2 - 5 3 0.36 0.56 0.46
Leukemia-4 k-means 75% - 85% 3 - 8 10 0.42 0.92 0.59
Leukemia-5 k-means 25% - 50% 2 - 8 10 0.15 0.89 0.45
Leukemia-6 k-means 25% - 50% 2 - 8 10 0.18 0.84 0.47
Lung tumor-1 k-means 25% - 50% 3 - 14 10 0.10 0.24 0.18
Lung tumor-2 k-means 25% - 50% 2 - 8 10 0.08 0.32 0.19
Lymphoma-1 k-means 25% - 50% 2 - 6 10 0.02 0.43 0.17
Lymphoma-2 k-means 25% - 50% 3 - 7 10 0.20 0.52 0.33
Lymphoma-4 k-means 25% - 50% 2 - 8 10 -0.01 0.32 0.11
Melanoma manual, random N/A 2 5 0.38 0.70 0.50
Mesothelioma k-means 25% - 50% 2 - 13 10 0.07 0.75 0.25
Multi-tissue k-means 25% - 50% 2 - 10 10 0.15 0.41 0.31
Prostate cancer-1 manual N/A 5 - 7 5 0.43 0.61 0.52
Prostate cancer-2 manual N/A 4 - 6 5 0.44 0.61 0.52
Prostate cancer-3 manual N/A 4 - 7 4 0.24 0.64 0.42
Prostate cancer-4 manual N/A 5 - 6 3 0.51 0.61 0.55
Prostate cancer-5 k-means 25% - 50% 2 - 10 10 0.02 0.23 0.10
Round blue-cell tumor k-means 25% - 50% 2 - 9 10 0.10 0.90 0.49
Serrated carcinomas manual N/A 2 - 6 5 0.29 0.51 0.37

Results of Ensemble Algorithms

Data set name DiCLENS MCLA CSPA HGPA LCE
Bladder carcinoma 0.59 0.56 0.32 0.36 0.41
Breast Cancer 0.63 0.56 0.44 0.50 0.56
Breast-Colon tumors 0.92 0.85 0.62 0.75 0.92
Carcinomas 0.46 0.48 0.35 0.43 0.57
Central nervous system-1 1.00 0.88 0.33 0.33 1.00
Central nervous system-2 0.54 0.54 0.60 0.51 0.61
Endometrial cancer 1.00 0.92 0.55 0.42 0.92
Glioblastoma multiforme 0.46 0.16 0.13 0.13 0.16
Gliomagenesis 0.55 0.38 0.25 0.34 0.37
Gliomas-1 0.92 0.92 0.73 0.88 0.92
Gliomas-2 0.72 0.60 0.39 0.35 1.00
Gliomas-3 1.00 1.00 0.51 0.27 1.00
Hepatocellular carcinoma 0.72 0.62 0.65 -0.01 0.64
Leukemia-1 0.96 0.15 0.10 0.11 0.96
Leukemia-2 0.38 0.31 0.25 0.26 0.37
Leukemia-3 0.94 0.94 0.44 0.24 0.56
Leukemia-4 0.92 0.92 0.81 0.92 0.92
Leukemia-5 0.89 0.84 0.52 0.48 0.84
Leukemia-6 0.84 0.74 0.54 0.48 0.79
Lung tumor-1 0.55 0.29 0.12 0.11 0.32
Lung tumor-2 0.28 0.25 0.09 0.09 0.15
Lymphoma-1 0.50 0.26 0.37 0.12 0.37
Lymphoma-2 0.83 0.38 0.39 0.35 0.38
Lymphoma-3 0.31 0.20 0.25 0.17 0.25
Melanoma 0.89 0.89 0.89 0.2 0.7
Mesothelioma 0.89 0.78 0.12 0.14 0.78
Multi-tissue 0.28 0.41 0.38 0.28 0.44
Prostate cancer-1 0.89 0.89 0.58 0.52 0.66
Prostate cancer-2 0.90 0.92 0.60 0.47 0.79
Prostate cancer-3 0.65 0.50 0.46 0.27 0.47
Prostate cancer-4 0.78 0.71 0.44 0.32 0.86
Prostate cancer-5 0.13 0.07 0.07 0.05 0.02
Round blue-cell tumor 0.94 0.89 0.55 0.73 0.89
Serrated carcinomas 1.00 0.88 0.20 0.15 1.00

Number of Clusters

Data set name True Cluster # DiCLENS
Bladder carcinoma 3 2
Breast Cancer 2 2
Breast-Colon tumors 2 2
Carcinomas 10 6
Central nervous system-1 2 2
Central nervous system-2 5 4
Endometrial cancer 4 4
Glioblastoma multiforme 3 2
Gliomagenesis 3 2
Gliomas-1 4 4
Gliomas-2 2 2
Gliomas-3 2 2
Hepatocellular carcinoma 2 3
Leukemia-1 2 2
Leukemia-2 6 3
Leukemia-3 2 2
Leukemia-4 3 3
Leukemia-5 2 2
Leukemia-6 3 3
Lung tumor-1 5 3
Lung tumor-2 4 2
Lymphoma-1 2 2
Lymphoma-2 3 2
Lymphoma-3 2 3
Melanoma 2 2
Mesothelioma 2 2
Multi-tissue 14 5
Prostate cancer-1 5 5
Prostate cancer-2 4 5
Prostate cancer-3 3 2
Prostate cancer-4 4 5
Prostate cancer-5 2 6
Round blue-cell tumor 4 5
Serrated carcinomas 2 2

Artificial Data Sets

Properties of Artificial Data Sets

Data Set Number of Real Classes Number of Features Number of Data Points
2curves 2 2 193
2halfrings 2 2 118
4c10k 4 3 10000
4c20k 4 3 20000
4c40k 4 3 40000

Generating Input Clusters

Data set name Method Number of Clusters in Clusterings Number of Clusterings Min ARI Max ARI Average ARI
2curves manual, random 4 - 6 3 0.33 0.78 0.532
2halfrings k-means 2 - 5 3 0.414 0.805 0.656
4c10k k-means 5 - 6 4 0.727 0.818 0.765
4c20k k-means 3 - 6 9 0.57 0.928 0.739
4c40k k-means 10 3 - 6 0.557 0.874 0.759

Input clusterings for artificial data sets can be downloaded here.

Artificial Data Set ARI Results

Data Set DiCLENS MCLA CSPA HGPA LCE
2curves 0.979 0.979 0.979 0.29 0.98
2halfrings 1 1 1 0.578 1
4c10k 0.986 0.79 N/A 0 0.98
4c20k 0.978 0.977 N/A 0 0.98
4c40k 0.799 0.977 N/A 0 0.98

2HALFRINGS Data Set

Click on the figures for larger ones.

DiCLENS Output - ARI: 1.0

DiCLENS Output - ARI: 1.0

Input Clustering #1 - ARI: 0.80

Input Clustering #1 - ARI: 0.80

Input Clustering #1 - ARI: 0.75

Input Clustering #2 - ARI: 0.75

Input Clustering #1 - ARI: 0.41

Input Clustering #3 - ARI: 0.41

2CURVES Data Set

Click on the figures for larger ones.

DiCLENS Output - ARI: 0.98

DiCLENS Output - ARI: 0.98

Input Clustering #1 - ARI:0.49

Input Clustering #1 - ARI:0.49

Input Clustering #2 - ARI: 0.78

Input Clustering #2 - ARI: 0.78

Input Clustering #3 - ARI: 0.33

Input Clustering #3 - ARI: 0.33

Input Clustering #4 - ARI: 0.38

Input Clustering #4 - ARI: 0.38

S4C10K Data Set

Click on the figures for larger ones.

DiCLENS Output - ARI: 0.99

DiCLENS Output - ARI: 0.99

Input Clustering #1 - ARI: 0.76

Input Clustering #1 - ARI: 0.76

Input Clustering #2 - ARI: 0.73

Input Clustering #2 - ARI: 0.73

Input Clustering #3 - ARI: 0.76

Input Clustering #3 - ARI: 0.76

Input Clustering #4 - ARI: 0.82

Input Clustering #4 - ARI: 0.82

S4C20K Data Set

Click on the figures for larger ones.

DiCLENS Output - ARI: 0.98

DiCLENS Output - ARI: 0.98

Input Clustering #1 - ARI: 0.80

Input Clustering #1 - ARI: 0.80

Input Clustering #2 - ARI: 0.81

Input Clustering #2 - ARI: 0.81

Input Clustering #3 - ARI: 0.57

Input Clustering #3 - ARI: 0.57

 

Input Clustering #4 - ARI: 0.72

Input Clustering #4 - ARI: 0.72

Input Clustering #5 - ARI: 0.64

Input Clustering #5 - ARI: 0.64

Input Clustering #6 - ARI: 0.81

Input Clustering #6 - ARI: 0.81

 

Input Clustering #7 - ARI: 0.57

Input Clustering #7 - ARI: 0.57

Input Clustering #8 - ARI: 0.93

Input Clustering #8 - ARI: 0.93

Input Clustering #9 - ARI: 0.80

Input Clustering #9 - ARI: 0.80

 

S4C40K Data Set

Click on the figures for larger ones.

DiCLENS Output - ARI: 0.98

DiCLENS Output - ARI: 0.98

Input Clustering #1 - ARI: 0.81

Input Clustering #1 - ARI: 0.81

Input Clustering #2 - ARI: 0.80

Input Clustering #2 - ARI: 0.80

Input Clustering #3 - ARI: 0.87

Input Clustering #3 - ARI: 0.80

 

Input Clustering #4 - ARI: 0.58

Input Clustering #4 - ARI: 0.87

Input Clustering #5 - ARI: 0.76

Input Clustering #5 - ARI: 0.58

Input Clustering #6 - ARI: 0.80

Input Clustering #6 - ARI: 0.76

 

Input Clustering #7 - ARI: 0.80

Input Clustering #7 - ARI: 0.80

Input Clustering #8 - ARI: 0.80

Input Clustering #8 - ARI: 0.80

Input Clustering #9 - ARI: 0.56

Input Clustering #9 - ARI: 0.56

Input Clustering #10 - ARI: 0.81

Input Clustering #10 - ARI: 0.81

 

Real Data Sets

Real Data Sets

ARI results of cluster ensemble algorithms for the data sets Image Segmentation and Glass, which are obtained from UCI Machine Learning Repository.Input clusterings for these data sets can be downloaded here.

Data Set Number of Real Classes Number of Features Number of Data Points
segmentation 7 19 2310
glass 6 10 214

Generating Input Clusters

Data set name Method Number of Clusters in Clusterings Number of Clusterings Min ARI Max ARI Average ARI
segmentation k-means 7 5 0.436 0.525 0.46
glass k-means 5 - 7 5 0.581 0.731 0.642

Real Data Set ARI Results

Data Set DiCLENS MCLA CSPA HGPA LCE
segmentation 0.923 0.922 0.912 0.621 0.89
glass 1 0.996 0.512 0.207 0.73