TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation
Jonathan Cook, Tim Rockt\"aschel, Jakob Foerster, Dennis Aumiller,, Alex Wang

TL;DR
This paper introduces TICK, an automated, interpretable evaluation method using LLM-generated checklists to assess and improve LLM performance, leading to better alignment with human judgments and enhanced generation quality.
Contribution
The work presents TICK and STICK, novel LLM-based checklist generation and self-refinement techniques that improve evaluation accuracy and response quality in a fully automated manner.
Findings
TICK increases agreement between LLM and human preferences from 46.4% to 52.2%.
STICK improves generation performance with +7.8% and +6.3% gains on benchmarks.
Checklists enhance inter-annotator agreement from 0.194 to 0.256.
Abstract
Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability. Preference judgments between model outputs have become the de facto evaluation standard, despite distilling complex, multi-faceted preferences into a single ranking. Furthermore, as human annotation is slow and costly, LLMs are increasingly used to make these judgments, at the expense of reliability and interpretability. In this work, we propose TICK (Targeted Instruct-evaluation with ChecKlists), a fully automated, interpretable evaluation protocol that structures evaluations with LLM-generated, instruction-specific checklists. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists that decompose the instruction into a series of YES/NO questions. Each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Topic Modeling
