VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang

TL;DR
This paper introduces VTCBench, a benchmark to evaluate vision-language models' ability to understand long contexts with vision-text compression, revealing current models' limitations in long-term dependency understanding.
Contribution
It systematically assesses VLMs on long-context tasks with VTC, providing the first benchmark and insights into their capabilities and shortcomings.
Findings
Most VLMs decode OCR well but struggle with long-context understanding.
Models fail to capture long associations or dependencies in VTC-processed information.
VTCBench reveals significant gaps in current models' long-term reasoning abilities.
Abstract
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
