Improve LLM-as-a-Judge Ability as a General Ability
Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, Xuelong Li

TL;DR
This paper presents a two-stage training approach for large language models to improve their ability as general judges, enhancing accuracy and efficiency in evaluating responses across diverse scenarios, with state-of-the-art results.
Contribution
The work introduces a novel two-stage training method combining supervised fine-tuning and preference optimization, along with an efficient data synthesis technique, to improve LLMs' judging capabilities with less data.
Findings
Achieves state-of-the-art performance on RewardBench.
Requires only 2-40% of data compared to other methods.
Enhances downstream policy optimization through improved judge signals.
Abstract
LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values, ensuring ethical and reliable AI outputs that align with societal norms. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM's judge ability. In this work, we regard judge ability as a general ability of LLM and implement a two-stage training approach, comprising supervised fine-tuning (SFT) warm-up and direct preference optimization (DPO) enhancement, to achieve judge style adaptation and improve judgment accuracy. Additionally, we introduce an efficient data synthesis method to generate judgmental content. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLegal Education and Practice Innovations · Artificial Intelligence in Law · Dispute Resolution and Class Actions
MethodsDirect Preference Optimization · ALIGN · Focus
