Training Language Models to Win Debates with Self-Play Improves Judge Accuracy
Samuel Arnesen, David Rein, Julian Michael

TL;DR
This paper demonstrates that training language models through self-play debates enhances their ability to evaluate and judge other models more accurately, especially in complex comprehension tasks, compared to non-debate approaches.
Contribution
It introduces a debate-based training method for language models that improves their evaluative accuracy and argument quality in complex tasks.
Findings
Debate-trained models outperform non-debate models in judging accuracy.
Debate training leads to more informative and stronger arguments.
Debate approach shows promise for supervising difficult tasks.
Abstract
We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
