TL;DR
SCoPE VLM introduces a selective, recursive document navigation method for vision-language models, significantly improving efficiency and human-like reading in long-context document understanding tasks.
Contribution
It presents a novel Chain of Scroll mechanism and reinforcement learning approach to model agentic reading behaviors in vision-language models for document navigation.
Findings
Reduces memory usage compared to existing methods
Models human-like reading behaviors effectively
First framework to explicitly model agentic reading in multi-page QA
Abstract
Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
