LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
Roy Xie, Dan Friedman, Donghan Yu, Bowen Pan, Christopher Fifty, Jang-Hyun Kim, Xianzhi Du, Zhe Gan, Vivek Rathod, Bhuwan Dhingra

TL;DR
LensVLM introduces a selective expansion method for compressed visual representations in vision-language models, maintaining high accuracy at significant compression levels across various tasks.
Contribution
It presents a novel inference framework and training recipe enabling VLMs to selectively expand relevant compressed images, improving accuracy under high compression.
Findings
Maintains accuracy comparable to full-text at 4.3x compression.
Outperforms baselines up to 10.1x compression on text QA benchmarks.
Generalizes to multimodal document and code understanding tasks.
Abstract
Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder's effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
