Document Clustering Evaluation: Divergence from a Random Baseline
Christopher M. De Vries, Shlomo Geva, Andrew Trotman

TL;DR
This paper introduces a divergence from a random baseline technique for evaluating document clustering quality, which normalizes measures and distinguishes effective from ineffective clusterings across various evaluation methods.
Contribution
It presents a novel evaluation approach using divergence from a random baseline applicable to any cluster quality measure, enhancing the assessment of clustering effectiveness.
Findings
Differentiates ineffective clusterings in XML Mining
Normalizes cluster quality measures similar to NMI
Provides clear optima in distortion measures like RMSE
Abstract
Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Text and Document Classification Technologies · Data Mining Algorithms and Applications
