VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Hongbo Zhao; Meng Wang; Fei Zhu; Wenzhuo Liu; Bolin Ni; Fanhu Zeng; Gaofeng Meng; Zhaoxiang Zhang

arXiv:2512.15649·cs.CV·December 24, 2025

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces VTCBench, a benchmark to evaluate vision-language models' ability to understand long contexts with vision-text compression, revealing current models' limitations in long-term dependency understanding.

Contribution

It systematically assesses VLMs on long-context tasks with VTC, providing the first benchmark and insights into their capabilities and shortcomings.

Findings

01

Most VLMs decode OCR well but struggle with long-context understanding.

02

Models fail to capture long associations or dependencies in VTC-processed information.

03

VTCBench reveals significant gaps in current models' long-term reasoning abilities.

Abstract

The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MLLM-CL/VTCBench
dataset· 94 dl
94 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques