TL;DR
This paper introduces $T^5Score$, a new evaluation methodology for assessing the quality of multi-document topic sets generated by LLMs, addressing the limitations of existing practices and enabling reliable, high-agreement assessments.
Contribution
The paper presents $T^5Score$, a novel, decompositional evaluation framework for LLM-generated topics that improves reliability and inter-annotator agreement.
Findings
$T^5Score$ achieves high inter-annotator agreement.
It effectively decomposes topic quality into measurable aspects.
Experimental results validate its applicability across datasets.
Abstract
Using LLMs for Multi-Document Topic Extraction has recently gained popularity due to their apparent high-quality outputs, expressiveness, and ease of use. However, most existing evaluation practices are not designed for LLM-generated topics and result in low inter-annotator agreement scores, hindering the reliable use of LLMs for the task. To address this, we introduce , an evaluation methodology that decomposes the quality of a topic set into quantifiable aspects, measurable through easy-to-perform annotation tasks. This framing enables a convenient, manual or automatic, evaluation procedure resulting in a strong inter-annotator agreement score. To substantiate our methodology and claims, we perform extensive experimentation on multiple datasets and report the results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSparse Evolutionary Training
