Learning to Reason for Factuality
Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O\u{g}uz, Rulin Shao, Gargi Ghosh, Jason Weston, Wen-tau Yih

TL;DR
This paper introduces a new online reinforcement learning approach with a novel reward function to improve the factual accuracy of reasoning large language models, significantly reducing hallucinations without losing response quality.
Contribution
It presents a novel reward function for online RL that balances factuality, detail, and relevance, enhancing LLM factual reasoning capabilities.
Findings
23.1 percentage points reduction in hallucination rate
23% increase in answer detail level
no degradation in response helpfulness
Abstract
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper tackles an important issue: factuality in reasoning LLMs, extending the RL-for-reasoning paradigm beyond purely verifiable domains. 2. The authors systematically diagnose failure modes (over-precision → short answers; spurious detail → irrelevant verbosity) and explicitly design against them.
1. The overlap with concurrent factual-RL papers (Li & Ng (2025) and Ren et al. (2025)) could be better delineated; distinguishing features beyond “long-form” need clearer articulation. 2. The reward is empirically motivated but lacks formal justification or convergence discussion, e.g., how the multi-objective reward interacts with GRPO stability. 3. Given that the core claim is reduction in hallucination, human-verified factuality on a subset would strengthen the argument considerably.
Overall, I appreciate the research question and motivation of this paper, which tackles an important and timely problem with clear significance. The authors observe that state-of-the-art reasoning models (DeepSeek-R1, QwQ-32B) exhibit significantly higher hallucination rates than their non-reasoning counterparts (10-13 percentage points worse on average, Table 1) is both surprising and concerning. This finding challenges the implicit assumption that "more reasoning=better quality" and highlights
1. The motivation of the paper is appealing in that it aims to address the issue that previous methods for long-form factuality evaluation have not considered the relevance between the question and the corresponding answer. However, the implementation is somewhat disappointing: it merely compares whether the optimized model's responses are better than those of the base model. But what if the base model's answer is itself irrelevant to the question? This approach does not directly solve the state
- The paper clearly identifies the challenge of long-form factuality in reasoning LLMs and motivates the need for online RL training. - The authors implement an efficient online version of VeriScore, reducing verification time to a few seconds per response. - The experiments cover six factuality benchmarks, showing consistent gains in both factual precision and supported facts.
The answer relevance reward depends on another LLM’s judgment, which may introduce bias from the judge model. - The results may also be sensitive to the choice of the reward LLM, yet the paper does not clearly specify which model or size was used as the judge. - The authors mention that FactScore leads to less detailed answers, but the paper provides limited explanation of how detail level is precisely measured or how the model avoids generating irrelevant but correct statements. - Because bo
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
