TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and   Generation

Jonathan Cook; Tim Rockt\"aschel; Jakob Foerster; Dennis Aumiller,; Alex Wang

arXiv:2410.03608·cs.AI·October 7, 2024

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation

Jonathan Cook, Tim Rockt\"aschel, Jakob Foerster, Dennis Aumiller,, Alex Wang

PDF

Open Access

TL;DR

This paper introduces TICK, an automated, interpretable evaluation method using LLM-generated checklists to assess and improve LLM performance, leading to better alignment with human judgments and enhanced generation quality.

Contribution

The work presents TICK and STICK, novel LLM-based checklist generation and self-refinement techniques that improve evaluation accuracy and response quality in a fully automated manner.

Findings

01

TICK increases agreement between LLM and human preferences from 46.4% to 52.2%.

02

STICK improves generation performance with +7.8% and +6.3% gains on benchmarks.

03

Checklists enhance inter-annotator agreement from 0.194 to 0.256.

Abstract

Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability. Preference judgments between model outputs have become the de facto evaluation standard, despite distilling complex, multi-faceted preferences into a single ranking. Furthermore, as human annotation is slow and costly, LLMs are increasingly used to make these judgments, at the expense of reliability and interpretability. In this work, we propose TICK (Targeted Instruct-evaluation with ChecKlists), a fully automated, interpretable evaluation protocol that structures evaluations with LLM-generated, instruction-specific checklists. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists that decompose the instruction into a series of YES/NO questions. Each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Topic Modeling