Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
Yunhan Bu, Quan Zhang, Huaping Zhang, Guotong Geng, Chunxiao Gao, Askar Hamdulla, Juan Wang, Qiuchi Li, Baohua Zhang, Shuai Lei, Yunbo Cao, Zhunchen Luo

TL;DR
This paper introduces a Structural Causal Model-based framework with Group Relative Policy Optimization to improve multi-hop fact verification accuracy and interpretability in large language models.
Contribution
It proposes a novel SCM grounding and a GRPO-based reinforcement learning method to balance reasoning complexity and accuracy in fact verification.
Findings
Identifies an inverted U-shaped relationship between reasoning chain length and accuracy.
Demonstrates significant performance improvements over state-of-the-art baselines on HoVer and EX-FEVER datasets.
Abstract
Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
