TL;DR
ALDEN is a reinforcement learning framework that enables interactive, goal-directed navigation and reasoning in long, visually complex documents, surpassing passive reading methods in accuracy and efficiency.
Contribution
ALDEN introduces a novel active navigation approach with a fetch action, a rule-based reward system, and a visual-semantic anchoring mechanism for stable training of VLMs on long documents.
Findings
Achieves state-of-the-art results on five long-document benchmarks.
Effectively stabilizes training with visual-semantic anchoring.
Demonstrates improved navigation and reasoning in complex documents.
Abstract
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Introducing an RL agent-style framework for long-document visual understanding (A-VRDU) is novel and of clear research significance. - The integration of action space, reward modeling, and stabilization mechanisms forms a coherent and well-engineered system. - The proposed page-index-based action is particularly effective for structured documents requiring explicit page referencing and sequential reasoning. - The combination of turn-level and token-level rewards helps alleviate the sparsity an
- **Dataset issue**: Dataset is not public available, which limits its contribution. - **Theoretical Novelty and Assumptions** – Does the paper provide any theoretical analysis of the visual-semantic anchoring mechanism’s convergence or optimization properties? For instance, does the dual-KL regularization affect the convergence guarantees of PPO? The paper currently presents only empirical evidence; a qualitative or theoretical discussion would strengthen the contribution. - **Baseline Coverag
The paper offers a fresh perspective by framing long-document understanding as an active navigation problem. The introduction of explicit fetch actions for page-indexed retrieval is clever and bridges structured document navigation with multimodal reasoning. The technical implementation is well-motivated, especially the cross-level reward and the VSA mechanism, which stabilizes visual representations during RL training and effectively prevents multimodal collapse.
1. Both M3DocRAG and MDocAgent originally use ColPali as their default image retriever, but Table 2 reports results with ColQwen. Moreover, ALDEN uses Qwen2.5-VL, while M3DocRAG uses Qwen2-VL, and MDocAgent combines Llama3.1 and Qwen2-VL. These backbone differences make the comparison less fair. Including ablations where all methods use the same backbone would make the improvements more convincing. 2. ALDEN is a trained RL system that also performs multi-step reasoning at test time, while baseli
- The paper shows how reinforcement learning can be used in an effective way in the design of a framework for document understanding using a multi-step reasoning process. The proposed framework achieves state-of-the-art results in standard benchmarks. - The ablation study shows a positive impact in the performance of the model of the three new contributions of the paper: the fetch action, the cross-level reward function, and the visual semantic anchoring.
- Although the fetch action is shown to be useful to improve the performance, it seems to me that this is is mainly due to the bias of existing datasets towards questions asking about specific pages (an effect that is even stronger in the results of table 4 with the specific DUDE-sub dataset). In the more general and real DocVQA case, questions are expected to be mainly posed over the content of the document more than specific locations of the document, and thus, page numbers would probably have
- Clear and Significant Research Problem: The shift from passive document understanding to an active "Agentic VRDU" paradigm is a timely, well-motivated, and valuable research direction. The paper clearly articulates the limitations of existing methods, such as the rigidity of fixed RAG workflows , providing a strong justification for an active, agent-based approach. - Excellent Problem-Solution Mapping: The authors clearly identify three specific challenges in applying RL to A-VRDU: (1) the ins
- The fetch action seems to only be able to index a single-page document. However, in actual scenarios, the solution to the problem relies on multi-page documents (even discontinuous pages), and ALDEN cannot handle this situation. - The cross-referencing of the tables seems a little inconsistent. - The cross-level reward function, while effective, is complex and introduces numerous components (format reward, F1, NDCG, proximity distance, repetition penalty) and their associated hyperparameters (
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
