Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

TL;DR
This paper introduces FARE, a large-scale data-driven approach to training reasoning-centric evaluators for language models, demonstrating superior performance over existing specialized evaluators in multiple real-world tasks.
Contribution
The paper presents a new data scaling methodology and the FARE family of large, open-source evaluators trained with simple supervised finetuning, setting new standards in reasoning evaluation.
Findings
FARE-20B surpasses 70B+ specialized evaluators in benchmarks.
FARE-20B achieves near-oracle performance as inference-time rerankers.
Fine-tuned FARE improves RL training outcomes by up to 14.1%.
Abstract
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and…
Peer Reviews
Decision·ICLR 2026 Poster
* Well executed recipe without any perplexed tuning or parameterization * If the model will be released, then it will be especially useful * Multi-tasking across different eval formats was shown to work
* W.r.t. the last strength point, i would be great to see the ablation of multi-tasking and how that affects the performance on specific eval tasks. From my understanding, there is no such experiment in the paper. * When compared with competing evaluator models from literature, its a bit unclear what were other models initialization ckpts i.e. either its llama or qwen. Such difference may introduce not very fair comparison if we care about the added value of the proposed data and multitasking. A
1. The 2.5M-sample multi-task dataset is one of the largest curated for evaluation, spanning reasoning, code, math, and tool-use, enabling strong generalization across domains. 2. The use of RS-SFT (rejection sampling SFT) achieves performance comparable to RL-based methods while being computationally more efficient and stable. 3. Evaluations on 7 benchmarks and 3 real-world tasks (e.g., MATH reranking, RL training, code evaluation) show strong improvements over state-of-the-art baselines. 4. Th
1. Novelty is rather low. The authors themselves mention that the method is simple and is a minor modification of methods like STE and RAFT. The paper emphasizes empirical scaling but provides little theoretical analysis of why RS-SFT works better for evaluators compared to RL or DPO approaches. 2. While reasoning-centric, the dataset and evaluations largely focus on math, code, and tool-use; broader language understanding or multimodal evaluations are underexplored. 3. Though large-scale, the p
- The paper addresses a timely and important problem of building multi-task evaluators that can handle diverse evaluation scenarios, which is increasingly critical as LLMs become integrated into various applications. - The data curation strategy is thorough and well-designed, combining 1.4M existing samples with 1.1M synthetic samples using both programmatic error injection and generate-then-grade approaches across multiple domains. - The training methodology using iterative rejection sampling S
- The cold-start initialization procedure for Qwen3-8B-Base using Qwen2.5-32B-Instruct data is not well-justified, and the authors acknowledge this produces a weaker baseline than the post-trained Qwen3-8B (Table 9), raising questions about whether better initialization could further improve results. - The paper lacks detailed analysis of failure modes or systematic error analysis that would help understand when and why FARE models struggle, particularly on the benchmarks where performance lags
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Machine Learning and Data Classification
