Meta-Rewarding Language Models: Self-Improving Alignment with   LLM-as-a-Meta-Judge

Tianhao Wu; Weizhe Yuan; Olga Golovneva; Jing Xu; Yuandong Tian,; Jiantao Jiao; Jason Weston; Sainbayar Sukhbaatar

arXiv:2407.19594·cs.CL·July 31, 2024·2 cites

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian,, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar

PDF

Open Access 1 Video

TL;DR

This paper introduces a meta-rewarding mechanism for LLMs that enables self-improvement of both response quality and judgment skills, reducing reliance on human data and enhancing instruction-following capabilities.

Contribution

The paper proposes a novel meta-rewarding step allowing LLMs to judge and improve their own judgments, leading to better performance without human supervision.

Findings

01

Improved win rate of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2.

02

Enhanced judgment and instruction-following abilities.

03

Demonstrated potential for unsupervised self-improvement in LLMs.

Abstract

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge· underline

Taxonomy

TopicsArtificial Intelligence in Law