Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian,, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar

TL;DR
This paper introduces a meta-rewarding mechanism for LLMs that enables self-improvement of both response quality and judgment skills, reducing reliance on human data and enhancing instruction-following capabilities.
Contribution
The paper proposes a novel meta-rewarding step allowing LLMs to judge and improve their own judgments, leading to better performance without human supervision.
Findings
Improved win rate of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2.
Enhanced judgment and instruction-following abilities.
Demonstrated potential for unsupervised self-improvement in LLMs.
Abstract
Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Law
