SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

Daeyong Kwon; Soyoung Yoon; Seung-won Hwang

arXiv:2604.01993·cs.CL·April 3, 2026

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

PDF

TL;DR

SAFE introduces a verification framework for multi-hop QA that ensures reasoning steps are grounded and verifiable, improving accuracy and exposing benchmark flaws.

Contribution

It proposes a two-phase verification approach combining train-time error taxonomy and inference-time feedback to enhance reasoning reliability.

Findings

01

Identified up to 14% unanswerable instances in benchmarks.

02

Achieved an 8.4 percentage point accuracy improvement over baselines.

03

Demonstrated the effectiveness of verifiable reasoning trajectories.

Abstract

Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.