J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge
Chi-Min Chan, Chunpu Xu, Jiaming Ji, Zhen Ye, Pengcheng Wen, Chunyang Jiang, Yaodong Yang, Wei Xue, Sirui Han, Yike Guo

TL;DR
This paper introduces J1-7B, a fine-tuned large language model that leverages simple test-time scaling to enhance evaluation performance and interpretability in LLM-as-a-Judge systems, especially after reinforcement learning training.
Contribution
It proposes a novel approach combining supervised fine-tuning on reflection datasets, reinforcement learning, and test-time scaling to improve LLM evaluation methods and scaling behavior.
Findings
J1-7B surpasses previous state-of-the-art by 4.8%.
Significant scaling trend appears mainly after RL training.
Simple test-time scaling enhances performance and interpretability.
Abstract
The current focus of AI research is shifting from emphasizing model training towards enhancing evaluation quality, a transition that is crucial for driving further advancements in AI systems. Traditional evaluation methods typically rely on reward models assigning scalar preference scores to outputs. Although effective, such approaches lack interpretability, leaving users often uncertain about why a reward model rates a particular response as high or low. The advent of LLM-as-a-Judge provides a more scalable and interpretable method of supervision, offering insights into the decision-making process. Moreover, with the emergence of large reasoning models, which consume more tokens for deeper thinking and answer refinement, scaling test-time computation in the LLM-as-a-Judge paradigm presents an avenue for further boosting performance and providing more interpretability through reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Ethics and Social Impacts of AI
MethodsFocus
