TL;DR
This paper investigates using multimodal vision-language models with image representations of code to improve efficiency and maintain accuracy in code understanding tasks, showing promising results with significant compression.
Contribution
It is the first systematic study demonstrating that vision-language models can understand compressed code images effectively, reducing token costs substantially.
Findings
MLLMs can understand code with up to 8x token reduction.
Visual cues like syntax highlighting improve performance under compression.
Clone detection tasks are highly resilient to visual compression.
Abstract
Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
