Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression
Jianping Zhong, Guochang Li, Chen Zhi, Junxiao Han, Zhen Qin, Xinkui Zhao, Nan Wang, Shuiguang Deng, Jianwei Yin

TL;DR
This paper introduces LongCodeOCR, a visual compression method for code that enables Vision-Language Models to handle long contexts more effectively by preserving global structure, outperforming traditional textual compression in several benchmarks.
Contribution
We propose LongCodeOCR, a novel visual compression framework that maintains global code structure for VLMs, addressing limitations of existing textual filtering methods.
Findings
LongCodeOCR improves code summarization scores by 36.85 points over LongCodeZip.
It operates at 4x higher compression with better accuracy at 1M tokens.
Visual compression drastically reduces processing latency from hours to minutes.
Abstract
Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios (1.7), LongCodeOCR improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
