Nicole McNabb, Master’s candidate
David R. Cheriton School of Computer Science
Clustering is the task of partitioning data so that “similar” points are grouped together and “dissimilar” ones are separated. In general, this is an ill-defined task. One way to make clustering well-defined is to introduce a clustering objective to optimize. While many common objectives such as k-means are known to be NP-hard, heuristics output “nice” clustering solutions efficiently in practice. This work analyzes two avenues of theoretical research that attempt to explain this discrepancy.
The first avenue studies the hardness of approximation of common clustering objectives. In this work, we review the state-of-the-art approximation algorithms and hardness of approximation of three common clustering objectives: k-means, k-median, and k-center. We analyze the extent to which these results explain the discrepancy between hardness of optimization and practical efficiency.
The second avenue of research studies structural properties of data that can be clustered efficiently. We would like to show for some rigorous notion of structure that otherwise NP-hard objectives can be optimized efficiently for data exhibiting such structure. Most notions found to satisfy this goal, however, have proven unrealistic for real data. In this work, we review three structural notions yielding significant results for k-means, k-median, and k-center. We examine the extent to which these notions apply to real data and the implications of recent computational complexity results. Finally, we propose a way to combine the two avenues of research to help better explain the theory-practice gap in objective clustering.