Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence
Alexander Hoyle, Pranav Goel, Denis Peskov, Andrew Hian-Cheong, Jordan, Boyd-Graber, Philip Resnik

TL;DR
This paper questions the validity of automated topic coherence metrics by comparing them with human judgments and analyzing their consistency across classical and neural models.
Contribution
It highlights the validation gap in automated coherence measures for neural models and systematically evaluates models to reveal discrepancies with human assessments.
Findings
Automated coherence often disagrees with human judgments.
Neural models outperform classical models on automated metrics but not necessarily on human evaluations.
There is a significant standardization gap in topic model benchmarking.
Abstract
Topic model evaluation, like evaluation of other unsupervised methods, can be contentious. However, the field has coalesced around automated estimates of topic coherence, which rely on the frequency of word co-occurrences in a reference corpus. Contemporary neural topic models surpass classical ones according to these metrics. At the same time, topic model evaluation suffers from a validation gap: automated coherence, developed for classical models, has not been validated using human experimentation for neural models. In addition, a meta-analysis of topic modeling literature reveals a substantial standardization gap in automated topic modeling benchmarks. To address the validation gap, we compare automated coherence with the two most widely accepted human judgment tasks: topic rating and word intrusion. To address the standardization gap, we systematically evaluate a dominant classical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Explainable Artificial Intelligence (XAI)
