TL;DR
This paper introduces TIR-Judge, an RL framework that enhances LLM judges with tool-integrated reasoning, enabling more accurate evaluations across diverse domains and outperforming existing models on multiple benchmarks.
Contribution
The paper presents a novel RL-based training method for LLM judges that incorporates a code executor, improving their ability to verify complex constraints without relying on distillation.
Findings
TIR-Judge outperforms strong reasoning-based judges by up to 7.7%.
TIR-Judge-Zero matches the performance of distilled variants without using distilled data.
Listwise performance of TIR-Judge is comparable to larger models like Claude-Opus-4.
Abstract
Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise),…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well-written, and the motivation and approach are clear. The figures and case study enhance readability. 2. The proposed method framework is effective, and the evaluation is comprehensive, spanning a variety of baselines and several relevant benchmarks. 3. The flexible judgement formats (e.g., pointwise, pairwise, and listwise) expand the functionality of the framework.
1. Prior work has identified several biases common in LLM-based judges, e.g., positional bias [1], verbosity bias [2], and self-preference bias [3]. Other work [4] has shown that training can increase the prevalence of such biases. Given this, this work should evaluate for bias before and after training. 2. It's unclear how the baseline tool-augmented judges (e.g., Qwen3-4B-Tool or Gemini-2.5-Flash-Tool) are evaluated. Are they expected to output tool calls in the same format as TIR-Judge, e.g.,
1. Novel integration of tool-use with reinforcement learning: The work goes beyond inference-time tool invocation by embedding code execution into the RL loop, allowing models to learn when and how to use tools effectively. 2. Strong empirical performance: Demonstrated improvements across multiple datasets and formats, outperforming larger models like 32B RRM on several metrics. Particularly impressive is the performance of TIR-Judge-Zero, which learns without teacher supervision.
1.Domain bias: Gains are largest in verifiable domains (math, code), while improvements in non-verifiable areas (helpfulness, safety) are marginal. 2. Generalization limitations: The framework is Python-specific; it is unclear how well the method would extend to multiple heterogeneous tools or symbolic engines.
1. Defining clear and practical issues The practical limitations of LLM evaluators (accurate calculation, failure to verify constraints) are clearly presented with concrete examples. The importance of evaluators has been convincingly described throughout the model development pipeline (post-training, inference, and evaluation). 2. Comprehensive Experimental Design Researchers systematically evaluated it on six different benchmarks. They demonstrate generalizability by addressing all three evalu
1. Problem: The key components of this paper (tool-integrated reasoning, RL-based judgments, and projection sampling) are all techniques covered in previous studies. Combining them together is meaningful, but there is a lack of theoretical analysis or deep insight into why this particular combination is effective. Evidence: Feng et al. (2025), Li et al. (2025a), cited by Related Work, have already learned TIR as RL, while Chen et al. (2025b), Whitehouse et al. (2025) have dealt with RL-based ju
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
