DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

Xinyi Wang; Yiping Song; Zhiliang Tian; Bo Liu; Tingjin Luo; Minlie Huang

arXiv:2511.08364·cs.CL·December 2, 2025

DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

Xinyi Wang, Yiping Song, Zhiliang Tian, Bo Liu, Tingjin Luo, Minlie Huang

PDF

Open Access 1 Video

TL;DR

DPRM introduces a dual implicit reward modeling approach for multi-hop question answering, effectively guiding reasoning processes by jointly optimizing chain of thought and knowledge graph paths without extensive annotations.

Contribution

The paper proposes DPRM, a novel dual implicit process reward model that jointly trains two implicit reward models for reasoning steps in multi-hop QA, incorporating a consistency constraint.

Findings

01

Outperforms 13 baselines on multiple datasets

02

Achieves up to 16.6% improvement on Hit@1

03

Effectively guides multi-step reasoning without explicit annotations

Abstract

In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering· underline

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications