CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

Yu Liu; Wenxiao Zhang; Diandian Guo; Cong Cao; Fangfang Yuan; Qiang Sun; Yanbing Liu; Jin B. Hong; Zhiyuan Ma

arXiv:2602.01348·cs.CL·March 17, 2026

CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

Yu Liu, Wenxiao Zhang, Diandian Guo, Cong Cao, Fangfang Yuan, Qiang Sun, Yanbing Liu, Jin B. Hong, Zhiyuan Ma

PDF

Open Access

TL;DR

CRAFT is a reinforcement learning framework that enhances multi-hop question answering by producing structured, answer-faithful reasoning traces, improving accuracy and reasoning quality even under noisy retrieval conditions.

Contribution

It introduces a novel RL-based approach for training models to generate structured, auditable reasoning traces with configurable transparency levels.

Findings

01

CRAFT improves answer accuracy across model scales.

02

Semantic judge-based rewards enhance reasoning faithfulness.

03

CRAFT achieves competitive performance with strong closed-source models.

Abstract

Retrieval-augmented large language models, when optimized with outcome-level rewards, can achieve strong answer accuracy on multi-hop questions. However, under noisy retrieval, models frequently suffer from "right-answer-wrong-reason failures": they may exploit spurious shortcuts or produce reasoning traces weakly grounded in the supporting evidence. Furthermore, the lack of structured output control prevents reliable auditing of the underlying reasoning quality. To address this, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a reinforcement learning framework for the response generation stage of retrieval-augmented multi-hop question answering. CRAFT trains models to produce structured reasoning traces with configurable levels of auditability (e.g., by selectively retaining planning, evidence citation, or reasoning steps). Training combines two complementary forms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Multimodal Machine Learning Applications