Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
Momoka Furuhashi, Kouta Nakayama, Takashi Kodama, and Saku Sugawara

TL;DR
This paper examines the effectiveness of automatic checklists in evaluating generative language tasks, finding that selective use improves pairwise evaluation but has inconsistent benefits in direct scoring, and highlights issues in human evaluation consistency.
Contribution
It systematically evaluates different methods of checklist generation and their impact on automatic evaluation accuracy across various model sizes.
Findings
Selective checklist use improves pairwise evaluation accuracy.
Checklist items often reflect human criteria but show low correlation with scores.
Inconsistent benefits of checklists in direct scoring tasks.
Abstract
Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Mental Health via Writing
