TL;DR
UniDoc-RL is a reinforcement learning framework that enhances visual reasoning in Large Vision-Language Models by hierarchically refining visual evidence and using dense rewards for improved performance.
Contribution
It introduces a hierarchical action space and dense multi-reward scheme for end-to-end training of visual retrieval and reasoning models.
Findings
Achieves up to 17.7% improvements over previous RL-based methods.
Effectively refines visual evidence from coarse retrieval to fine-grained perception.
Surpasses state-of-the-art baselines on three benchmark datasets.
Abstract
Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
