PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering

Yu Liu; Wenxiao Zhang; Cong Cao; Wenxuan Lu; Fangfang Yuan; Diandian Guo; Kun Peng; Qiang Sun; Kaiyan Zhang; Yanbing Liu; Jin B.Hong; Bowen Zhou; Zhiyuan Ma

arXiv:2601.05465·cs.AI·January 12, 2026

PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering

Yu Liu, Wenxiao Zhang, Cong Cao, Wenxuan Lu, Fangfang Yuan, Diandian Guo, Kun Peng, Qiang Sun, Kaiyan Zhang, Yanbing Liu, Jin B.Hong, Bowen Zhou, Zhiyuan Ma

PDF

Open Access

TL;DR

PRISMA introduces a decoupled reinforcement learning framework with a multi-agent architecture for open-domain multi-hop question answering, improving reasoning, retrieval, and stability over previous methods.

Contribution

The paper proposes PRISMA, a novel multi-agent RL framework with a two-stage policy optimization for better reasoning and retrieval in multi-hop QA.

Findings

01

Achieves state-of-the-art results on ten benchmarks.

02

Effectively addresses retrieval collapse and learning instability.

03

Enables efficient deployment in real-world scenarios.

Abstract

Answering real-world open-domain multi-hop questions over massive corpora is a critical challenge in Retrieval-Augmented Generation (RAG) systems. Recent research employs reinforcement learning (RL) to end-to-end optimize the retrieval-augmented reasoning process, directly enhancing its capacity to resolve complex queries. However, reliable deployment is hindered by two obstacles. 1) Retrieval Collapse: iterative retrieval over large corpora fails to locate intermediate evidence containing bridge answers without reasoning-guided planning, causing downstream reasoning to collapse. 2) Learning Instability: end-to-end trajectory training suffers from weak credit assignment across reasoning chains and poor error localization across modules, causing overfitting to benchmark-specific heuristics that limit transferability and stability. To address these problems, we propose PRISMA, a decoupled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems