SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

TL;DR
SAFE introduces a verification framework for multi-hop QA that ensures reasoning steps are grounded and verifiable, improving accuracy and exposing benchmark flaws.
Contribution
It proposes a two-phase verification approach combining train-time error taxonomy and inference-time feedback to enhance reasoning reliability.
Findings
Identified up to 14% unanswerable instances in benchmarks.
Achieved an 8.4 percentage point accuracy improvement over baselines.
Demonstrated the effectiveness of verifiable reasoning trajectories.
Abstract
Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
