Capturing LLM Capabilities via Evidence-Calibrated Query Clustering
Fangzhou Wu, Sandeep Silwal, Qiuyi Zhang

TL;DR
This paper introduces ECC, a novel clustering algorithm that calibrates semantic embeddings with model comparisons to better reflect latent LLM capabilities, improving evaluation and query routing.
Contribution
ECC is a new capability-aware clustering method that aligns semantic embeddings with model performance, enabling more accurate LLM capability assessment.
Findings
ECC outperforms human-labeled baselines by 17.64 percentage points.
ECC surpasses embedding-based baselines by 18.02 percentage points.
ECC improves LLM capability ranking and query routing tasks.
Abstract
Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
