Fast $k$-means Seeding Under The Manifold Hypothesis
Poojan Shah, Shashwat Agrawal, Ragesh Jaiswal

TL;DR
This paper introduces a new seeding method for $k$-means clustering, leveraging the manifold hypothesis to achieve faster runtimes with predictable approximation guarantees, validated through extensive empirical testing.
Contribution
It proposes $ ext{Qkmeans}$, a novel seeding algorithm that exploits geometric properties of data on low-dimensional manifolds for improved efficiency and theoretical guarantees.
Findings
$ ext{Qkmeans}$ achieves $O( ho^{-2} ext{log} k)$ approximation.
The algorithm runs in $O(nD) + ilde{O}( ext{epsilon}^{1+ ho} ho^{-1}k^{1+ ext{gamma}})$ time.
Empirical results validate theoretical predictions across various domains.
Abstract
We study beyond worst case analysis for the -means problem where the goal is to model typical instances of -means arising in practice. Existing theoretical approaches provide guarantees under certain assumptions on the optimal solutions to -means, making them difficult to validate in practice. We propose the manifold hypothesis, where data obtained in ambient dimension concentrates around a low dimensional manifold of intrinsic dimension , as a reasonable assumption to model real world clustering instances. We identify key geometric properties of datasets which have theoretically predictable scaling laws depending on the quantization exponent using techniques from optimum quantization theory. We show how to exploit these regularities to design a fast seeding method called which provides approximate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Facility Location and Emergency Management · Stochastic Gradient Optimization Techniques
