Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Arduin Findeis; Floris Weers; Guoli Yin; Ke Ye; Ruoming Pang; Tom Gunter

arXiv:2507.17015·cs.CL·July 24, 2025

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Arduin Findeis, Floris Weers, Guoli Yin, Ke Ye, Ruoming Pang, Tom Gunter

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how external validation tools like web-search and code execution can enhance the quality of AI-generated annotations for evaluating large language models, especially in factual, math, and coding tasks.

Contribution

It introduces a tool-using agentic system that leverages external validation to improve annotation quality in challenging response domains for LLM evaluation.

Findings

01

External tools improve annotation quality in many cases.

02

Performance is sensitive to prompt parameters.

03

Need for better non-saturated benchmarks.

Abstract

Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the "better" response. This approach can provide feedback for domains where other hard-coded metrics are difficult to obtain (e.g., chat response quality), thereby helping model evaluation or training. However, for some domains high-quality pairwise comparisons can be tricky to obtain - from AI and humans. For example, for responses with many factual statements, annotators may disproportionately weigh writing quality rather than underlying facts. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks. We propose a tool-using agentic system…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

1. While the use of tools in AI-based applications is fairly commonplace now, there use for annotation system is an interesting and novel idea and the paper demonstrates fairly well that it works for a few domains at least. 2. The paper is well-written and presents fair experimental backing to its claims. 3. The paper introduced 3 novel datasets for evaluating domain specific annotation capabilities of Language models

Weaknesses

1. While the use of toolings for AI annotators is interesting, in the current iteration of the work, it is not very clear if it will scale with more custom toolings. In the agent evaluator discussed in the paper, eventhough it defaults to existing annotations for the no-tool use cases, the system shows a degradation in performance for RewardBench, the only OOD dataset evaluated. This makes me concerned about the generalizability of the system. 2. Two of the proposed benchmarks don't have baseli

Reviewer 02Rating 3Confidence 4

Strengths

1. clear paper writing 2. Classifying the input domain and selecting tools accordingly makes sense. 3. Substantial improvements on certain subsets, particularly APPS.

Weaknesses

1. My main concern is novelty. Several highly related (i.e., tool-augmented AI feedback), published papers have not been cited and clearly discussed ([1][2]). "Novel framework" sounds overclaim. 2. Studying pairwise feedback in domains with clear objective correctness (e.g., fact, code, math) is unjustified. 3. Mixed results. Performance slightly decreases on general domains (rewardbench) and math when the base model is stronger (e.g., GPT-4o). [1] https://arxiv.org/pdf/2310.01045 [2] https://o

Reviewer 03Rating 8Confidence 4

Strengths

This paper proposes a reasonable and interesting framework for improving pairwise judgements using automated annotators. Recent work has shown the strength of strong, automated pairwise annotators, and this work is a valuable extension of that, showing that ground truth information in the responses (that traditional LLM-only systems might not always pick up on) is valuable for making these decisions.

Weaknesses

While this paper shows strong results on annotation accuracy, it is unclear how well this improves downstream performance. I don't think this is a hard requirement for this work, but I'd be interested to see how model performance changes using this method to either generate preference data, or do best-of-n ranking for model outputs. I do not think this is required for this paper to be accepted, however.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Artificial Intelligence in Law