UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Jun Wang; Shuo Tan; Zelong Sun; Tiancheng Gu; Yongle Zhao; Ziyong Feng; Kaicheng Yang; Zhiwu Lu

arXiv:2604.14967·cs.CV·April 20, 2026

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Jun Wang, Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao, Ziyong Feng, Kaicheng Yang, Zhiwu Lu

PDF

1 Repo

TL;DR

UniDoc-RL is a reinforcement learning framework that enhances visual reasoning in Large Vision-Language Models by hierarchically refining visual evidence and using dense rewards for improved performance.

Contribution

It introduces a hierarchical action space and dense multi-reward scheme for end-to-end training of visual retrieval and reasoning models.

Findings

01

Achieves up to 17.7% improvements over previous RL-based methods.

02

Effectively refines visual evidence from coarse retrieval to fine-grained perception.

03

Surpasses state-of-the-art baselines on three benchmark datasets.

Abstract

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepglint/UniDoc-RL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.