TL;DR
CompassJudger-2 is a versatile, robust judge model trained with verifiable rewards, achieving high accuracy across multiple domains and setting new standards for LLM evaluation.
Contribution
The paper introduces CompassJudger-2, a generalist judge model with a novel training strategy and a comprehensive benchmark for cross-domain judgment evaluation.
Findings
Achieves superior performance on multiple judge and reward benchmarks.
Demonstrates competitive accuracy with larger models like DeepSeek-V3 and Qwen3-235B-A22B.
Proposes JudgerBenchV2 for standardized evaluation of judge models.
Abstract
Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
