Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Tong Chen; Akari Asai; Luke Zettlemoyer; Hannaneh Hajishirzi; Faeze Brahman

arXiv:2510.17733·cs.CL·October 21, 2025

Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, Faeze Brahman

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a binary retrieval-augmented reward method for training language models, significantly reducing hallucinations while maintaining performance on various tasks by rewarding only fully factual outputs.

Contribution

The paper presents a novel binary reward scheme for reinforcement learning that effectively mitigates hallucinations without degrading open-ended generation or downstream task performance.

Findings

01

39.3% reduction in hallucination rates

02

44.4% fewer incorrect answers on PopQA

03

No performance degradation on instruction following, math, or code

Abstract

Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* The writing is clear, and the proposed method is straightforward. * The experiments are thorough, and the work is relatively comprehensive.

Weaknesses

* The main difference between this work and the most relevant baseline, VeriScore, lies in the design of the reward signal. While VeriScore uses a soft reward based on the proportion of correct claims, this paper adopts a hard binary reward that simply judges whether any factual inconsistency exists. Apart from this modification, there are no substantial differences between the two approaches. Overall, this incremental improvement makes the work resemble more of a technical report than a full re

Reviewer 02Rating 2Confidence 4

Strengths

1. The method is simple and effective, easy to implement from an engineering perspective, and demonstrates greater robustness compared to dense reward models. 2. The experimental design simultaneously addresses both hallucination mitigation and preservation of general capabilities. 3. Experimental results show that the proposed approach improves the model’s factual accuracy, achieving strong overall performance.

Weaknesses

1. The authors claim that full-text contradiction detection avoids the “error accumulation” of claim-wise verification, but this statement lacks justification. Claim-level verification is an independent process without cumulative error, whereas full-text inputs may introduce contextual interference and order bias. A controlled comparison between the two verification granularities is recommended to support this claim. 2. The “I don’t know” samples are not human-labeled but automatically detected

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper replaces traditional continuous factuality scores with a binary retrieval-augmented reward, explicitly optimizing the model toward *avoiding factual errors*. This design leads to a simple yet effective mechanism for factual correction and hallucination reduction. 2. The experiments are well-structured and comprehensive, including ablation studies, multi-task and multi-model evaluations, and assessments of general capabilities. The results appear credible and consistent. 3. The paper

Weaknesses

1. The optimization objective fully depends on retrieved evidence — the binary reward is determined by whether the retrieved documents contradict the model output. Since retrieval systems inherently contain bias (e.g., source preference, recency errors, and incomplete coverage), the model is effectively optimized to align with retrieval consensus rather than the ground truth. In essence, the model learns retrieval alignment, not truth alignment. 2. This RLKF paper (https://arxiv.org/abs/2403.183

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning