Contributions to Clustering and Evaluating Clustering Quality

Speaker: Joshua Yee

Committee Members: Prof. Dan Simovici (Chair), Prof. Ming Ouyang, Prof. Eric Grinberg,Prof. Nurit Haspel

GPD: Prof. Dan Simovici

Date: January 19th, 2024 (Friday) at 11:00 AM ET

Zoom Link: [](

Passcode: 612010


Clustering describes the machine-learning task of trying to group objects in a data set such that similar objects belong to the same group. Methods of clustering, referred to as clustering algorithms, fall into the unsupervised learning category of tasks. This means that the machine attempting to group objects is not privy to the true class to which the object belongs, and must instead infer an object's class based on other information. Clustering algorithms are well-poised for application in data mining, in which we are given a data set and look to determine patterns within. This dissertation examines various ways that mathematical principles can be applied to clustering to produce effective algorithms. The first section of this dissertation explores the use of Rymon trees in biclustering. Biclustering is a type of clustering often used in analyzing gene expression data. This type of data displays unique characteristics such as row-to-column imbalance, feature value range, and feature representation that encourage the use of specialized clustering algorithms. We propose an algorithm using Rymon trees to iteratively construct suitable biclusters on binary data in an efficient manner. In the second section, we introduce the concept of an ultrametric and discuss its application to detecting outliers within a data set. We develop a method to transform a data set equipped with a standard metric into an ultrametric data space, and examine the impact this transformation has on the outliers of clusters in terms of the silhouette score. Using the behavior of outliers under ultrametric transformation, we offer a novel characterization and detection criterion for outliers. The problem of determining the natural number of clusters within an unlabeled data set is explored in the third section. We approach this problem by first introducing a mathematical formalization of hierarchical clustering. We use this formalization to re-frame the natural cluster number problem as a dual criteria optimization problem instead. A new algorithm for detecting the natural number of clusters is then proposed, using the solution of the dual criteria problem as a basis. In the final section, we create a variant of the classical notion of partition entropy into a new function sensitive to the metric structure of a data set. The mathematical properties of the entropy extension are investigated, and we present an application of the new entropy function to the problem of cluster validation.