SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

Gyubeum Lim; Yemo Koo; Vijay Krishna Madisetti

arXiv:2510.21850·cs.CV·January 28, 2026

SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti

PDF

1 Video

TL;DR

SCoPE VLM introduces a selective, recursive document navigation method for vision-language models, significantly improving efficiency and human-like reading in long-context document understanding tasks.

Contribution

It presents a novel Chain of Scroll mechanism and reinforcement learning approach to model agentic reading behaviors in vision-language models for document navigation.

Findings

01

Reduces memory usage compared to existing methods

02

Models human-like reading behaviors effectively

03

First framework to explicitly model agentic reading in multi-page QA

Abstract

Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models· underline