MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

Jiahang Lin; Kai Hu; Binghai Wang; Yuhao Zhou; Zhiheng Xi; Honglin Guo; Shichun Liu; Junzhe Wang; Shihan Dou; Enyu Zhou; Hang Yan; Zhenhua Han; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2604.13579·cs.CL·April 16, 2026

MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

Jiahang Lin, Kai Hu, Binghai Wang, Yuhao Zhou, Zhiheng Xi, Honglin Guo, Shichun Liu, Junzhe Wang, Shihan Dou, Enyu Zhou, Hang Yan, Zhenhua Han, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

TL;DR

MM-Doc-R1 introduces an agentic, vision-aware framework with a novel reinforcement learning algorithm to improve long document visual question answering, achieving significant performance gains on benchmark datasets.

Contribution

The paper presents MM-Doc-R1, a new multi-turn RL framework with Similarity-based Policy Optimization (SPO) for better training stability and accuracy in long document VQA tasks.

Findings

01

MM-Doc-R1 outperforms previous baselines by 10.4% on MMLongbench-Doc.

02

SPO improves training results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B.

03

The integrated framework advances state-of-the-art in complex long-document VQA.

Abstract

Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.