J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, Swarnadeep Saha

TL;DR
This paper introduces J1, a reinforcement learning framework that trains large language models to improve their evaluation capabilities by teaching them to think before judging, leading to state-of-the-art performance on multiple benchmarks.
Contribution
The paper presents a novel RL-based method for training LLM judges to think before judging, converting tasks into a unified format with verifiable rewards, and demonstrates superior performance with synthetic data training.
Findings
J1 models achieve state-of-the-art results on multiple benchmarks.
J1 outperforms larger models trained on real data using synthetic data.
Qualitative analysis shows J1 develops systematic evaluation strategies.
Abstract
The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3,…
Peer Reviews
Decision·ICLR 2026 Poster
The paper introduces a unified framework that converts both verifiable and non-verifiable evaluation tasks into verifiable formats using synthetic preference pairs; applies online RL to directly optimize the chain-of-thought reasoning in LLM judges; showing a novel consistency-based reward that enforces the same judgment regardless of response order; and develops a multitask J1 model that jointly learns from both pairwise and pointwise supervision. Although similar ideas have been employed in ot
1. The current setup focuses solely on pairwise and pointwise evaluation, without exploring extensions to multi-response or listwise judgment 2. The work defines both Verdict Correctness and Verdict Consistency rewards, but lacks any reward weighting or sensitivity analysis 3. The data used for training and evaluation primarily covers conversational and reasoning domains, with no experiments on diverse areas such as code or multimodal judgment 4. This paper omits any discussion of training cost
1. Strong Empirical Results: The model trained using RL showed strong and consistent improvement across benchmarks, and is able to match frontier model (e.g. o3-mini, Deepseek-R1) that is an order of magnitude bigger. 2. Comprehensive Ablation & Analysis: The author provides thorough analysis on things like positional bias in the ablation study, which helps better understand the behavior of the model, and show that through the consistency reward, the "Verdict Flip/Ties" rate decreases.
1. Training Complexity: while J1 shows stronger performance than EvalPlanner, which uses offline DPO training, it is not a fully apple-to-apple comparisons. It is unclear how these two methods (GRPO vs DPO) differ under the same compute budget.
- Unified, verifiable training recipe across “verifiable” and “non-verifiable” prompts enabling direct optimization via RL, not just DPO/SFT. - Bias mitigation at training time via batching both orderings and an order-consistency reward; clean and effective. - SOTA or competitive results on multiple judging benchmarks at reasonable model sizes, with notably strong PPE Correctness and RewardBench numbers; comprehensive comparison tables. - Thoughtful ablations (rewards, prompts; pairwise vs point
- “Non-verifiable” to verifiable conversion is underspecified. For subjective tasks, the paper leans on synthetic construction and pairwise labels; the validity of these labels as ground truth for reward is not rigorously validated with humans. - Limited causal analysis of bias reduction. While order-consistency improves, the paper doesn’t isolate the effect of batched dual-ordering vs consistency reward vs prompt phrasing with confidence intervals on all benchmarks. - A minor detail is that whi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Corporate Insolvency and Governance
