Composite Silhouette: A Subsampling-based Aggregation Strategy
Aggelos Semoglou, Aristidis Likas, John Pavlopoulos

TL;DR
The paper introduces Composite Silhouette, a new internal validation metric that combines micro- and macro-averaged Silhouette scores through subsampling to improve cluster number estimation.
Contribution
It proposes a subsampling-based aggregation strategy that balances micro- and macro-averaging biases for better cluster count determination.
Findings
Composite Silhouette outperforms traditional methods in synthetic datasets.
The method provides finite-sample guarantees for its estimates.
Experiments show improved accuracy in real-world data.
Abstract
Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
