Metric $k$-clustering using only Weak Comparison Oracles

Rahul Raychaudhury; Aryan Esmailpour; Sainyam Galhotra; Stavros Sintos

arXiv:2601.19333·cs.LG·January 28, 2026

Metric $k$-clustering using only Weak Comparison Oracles

Rahul Raychaudhury, Aryan Esmailpour, Sainyam Galhotra, Stavros Sintos

PDF

Open Access 3 Reviews

TL;DR

This paper introduces randomized algorithms for $k$-clustering that operate solely on weak comparison oracles, replacing exact distance queries with noisy relative comparisons, and achieves near-optimal clustering with low query complexity.

Contribution

It develops the first scalable clustering algorithms that use only noisy quadruplet comparison oracles, extending applicability to scenarios with limited or noisy distance information.

Findings

01

Achieves constant-factor approximation with $O(nk ext{polylog}(n))$ queries.

02

Improves to $(1+ ext{small }\varepsilon)$-approximation for bounded doubling dimension metrics.

03

Demonstrates integration of noisy, low-cost oracles like language models into clustering algorithms.

Abstract

Clustering is a fundamental primitive in unsupervised learning. However, classical algorithms for $k$ -clustering (such as $k$ -median and $k$ -means) assume access to exact pairwise distances -- an unrealistic requirement in many modern applications. We study clustering in the \emph{Rank-model (R-model)}, where access to distances is entirely replaced by a \emph{quadruplet oracle} that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with $n$ input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of $O (k \cdot polylog (n))$ centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum $k$ -clustering…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The problem setting is very natural, for cases where distance computation might be hard or expensivem, performing comparisions using machine learning models might be efficient. The proofs are clear to the best of my knowledge. This work removes the requirement of distance oracle and provides coreset construction with only the oracle queries. The results obtained are near-optimal.

Weaknesses

The experimental setting is very limited, with them being run on only one synthetic dataset. But the algorithmic contributions and results obtained are non-trivial and significant contribution so this is not really a major concern for me. Maybe a minor typo - probsort (in Lemma 2.1) requires noise parameter $\leq 1/4$ but appendix A says $\leq 1/2$.

Reviewer 02Rating 8Confidence 4

Strengths

Overall, I like this paper. It follows the recent line of work for weak-strong oracle models. However, different from existing results that almost exclusively showed the necessity of the strong oracle, this paper is the first to show that we could do something without the strong oracle at all (as far as I know). I think this is a nice conceptual contribution. The paper is well-written, and although I did not get the time to check all the steps, the technical overview is easy to follow. Therefore

Weaknesses

I do not see any major weakness in the paper. One thing I want the author to emphasize is that one can know the coreset without knowing the clustering, since I was confused for a moment about whether there is a contradiction with GRS [PODS’24]. Some of the technical overview is a bit wordy and dense with math. I understand it’s a bit hard to present it in a cleaner manner. Maybe you can add some figures for the guard and kernel sets, plus the filtering process. I believe that’ll help readers u

Reviewer 03Rating 8Confidence 4

Strengths

The paper is well-written and easy to follow. The weak-oracle model is well motivated, and oracle-based modes in general have gained significant attention in the last few years.

Weaknesses

While it is clear why ALG-DI required bounded doubling dimension, it is not entirely clear why ALG-D cannot be used in the general setting Minor technical comments (see Detailed technical overview)

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Stochastic Gradient Optimization Techniques · Facility Location and Emergency Management