Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Austin Xu; Xuan-Phi Nguyen; Yilun Zhou; Chien-Sheng Wu; Caiming Xiong; Shafiq Joty

arXiv:2510.17793·cs.CL·November 20, 2025

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

PDF

Open Access 2 Models 3 Reviews

TL;DR

This paper introduces FARE, a large-scale data-driven approach to training reasoning-centric evaluators for language models, demonstrating superior performance over existing specialized evaluators in multiple real-world tasks.

Contribution

The paper presents a new data scaling methodology and the FARE family of large, open-source evaluators trained with simple supervised finetuning, setting new standards in reasoning evaluation.

Findings

01

FARE-20B surpasses 70B+ specialized evaluators in benchmarks.

02

FARE-20B achieves near-oracle performance as inference-time rerankers.

03

Fine-tuned FARE improves RL training outcomes by up to 14.1%.

Abstract

Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

* Well executed recipe without any perplexed tuning or parameterization * If the model will be released, then it will be especially useful * Multi-tasking across different eval formats was shown to work

Weaknesses

* W.r.t. the last strength point, i would be great to see the ablation of multi-tasking and how that affects the performance on specific eval tasks. From my understanding, there is no such experiment in the paper. * When compared with competing evaluator models from literature, its a bit unclear what were other models initialization ckpts i.e. either its llama or qwen. Such difference may introduce not very fair comparison if we care about the added value of the proposed data and multitasking. A

Reviewer 02Rating 6Confidence 5

Strengths

1. The 2.5M-sample multi-task dataset is one of the largest curated for evaluation, spanning reasoning, code, math, and tool-use, enabling strong generalization across domains. 2. The use of RS-SFT (rejection sampling SFT) achieves performance comparable to RL-based methods while being computationally more efficient and stable. 3. Evaluations on 7 benchmarks and 3 real-world tasks (e.g., MATH reranking, RL training, code evaluation) show strong improvements over state-of-the-art baselines. 4. Th

Weaknesses

1. Novelty is rather low. The authors themselves mention that the method is simple and is a minor modification of methods like STE and RAFT. The paper emphasizes empirical scaling but provides little theoretical analysis of why RS-SFT works better for evaluators compared to RL or DPO approaches. 2. While reasoning-centric, the dataset and evaluations largely focus on math, code, and tool-use; broader language understanding or multimodal evaluations are underexplored. 3. Though large-scale, the p

Reviewer 03Rating 8Confidence 4

Strengths

- The paper addresses a timely and important problem of building multi-task evaluators that can handle diverse evaluation scenarios, which is increasingly critical as LLMs become integrated into various applications. - The data curation strategy is thorough and well-designed, combining 1.4M existing samples with 1.1M synthetic samples using both programmatic error injection and generate-then-grade approaches across multiple domains. - The training methodology using iterative rejection sampling S

Weaknesses

- The cold-start initialization procedure for Qwen3-8B-Base using Qwen2.5-32B-Instruct data is not well-justified, and the authors acknowledge this produces a weaker baseline than the post-trained Qwen3-8B (Table 9), raising questions about whether better initialization could further improve results. - The paper lacks detailed analysis of failure modes or systematic error analysis that would help understand when and why FARE models struggle, particularly on the benchmarks where performance lags

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning and Data Classification