J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

Austin Xu; Yilun Zhou; Xuan-Phi Nguyen; Caiming Xiong; Shafiq Joty

arXiv:2505.13346·cs.CL·June 19, 2025

J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

PDF

Open Access 1 Datasets

TL;DR

This paper introduces EIS-GRPO, a reinforcement learning algorithm for training language model judges that are more robust in reasoning tasks, along with a new benchmark to evaluate their performance.

Contribution

The paper presents a novel RL algorithm EIS-GRPO for training judges, a new benchmark ReasoningJudgeBench, and a 7B judge model that outperforms larger models in reasoning evaluations.

Findings

01

J4R outperforms GPT-4o and other small judges in reasoning tasks.

02

EIS-GRPO reduces positional biases in judge training.

03

Judge trained with EIS-GRPO matches or exceeds larger models on evaluation benchmarks.

Abstract

To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Salesforce/ReasoningJudgeBench
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law