DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta

TL;DR
DocSLM is an efficient, lightweight vision-language model designed for understanding long multimodal documents on resource-constrained devices, achieving high performance with significantly reduced memory and computational requirements.
Contribution
We introduce DocSLM, a novel small vision-language model with a hierarchical compressor and streaming mechanism, enabling scalable long-document understanding on edge devices.
Findings
Matches or surpasses state-of-the-art on multiple benchmarks.
Uses 82% fewer visual tokens and 75% fewer parameters.
Achieves 71% lower latency while maintaining performance.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Neural Network Applications
