Self-critiquing models for assisting human evaluators
William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang,, Jonathan Ward, Jan Leike

TL;DR
This paper demonstrates that fine-tuned large language models can generate useful critiques to assist human evaluators, improving the detection of flaws in summaries and enabling self-improvement of their outputs, with larger models performing better.
Contribution
It introduces a method for fine-tuning models to produce critiques that aid human evaluation and explores the scaling properties and self-critique capabilities of these models.
Findings
Larger models generate more helpful critiques.
Models improve their summaries by integrating self-critiques.
Critiquing ability correlates with, but is distinct from, generation and discrimination abilities.
Abstract
We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques
