DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

Tanveer Hannan; Dimitrios Mallios; Parth Pathak; Faegheh Sardari; Thomas Seidl; Gedas Bertasius; Mohsen Fayyaz; Sunando Sengupta

arXiv:2511.11313·cs.CV·November 24, 2025

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta

PDF

Open Access

TL;DR

DocSLM is an efficient, lightweight vision-language model designed for understanding long multimodal documents on resource-constrained devices, achieving high performance with significantly reduced memory and computational requirements.

Contribution

We introduce DocSLM, a novel small vision-language model with a hierarchical compressor and streaming mechanism, enabling scalable long-document understanding on edge devices.

Findings

01

Matches or surpasses state-of-the-art on multiple benchmarks.

02

Uses 82% fewer visual tokens and 75% fewer parameters.

03

Achieves 71% lower latency while maintaining performance.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Neural Network Applications