CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi; Chaoxiang Xie; Zhensu Sun; Yeheng Chen; Chenxu Zhang; Longfei Yun; Chengcheng Wan; Hongyu Zhang; David Lo; Xiaodong Gu

arXiv:2602.01785·cs.CL·April 29, 2026

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu

PDF

1 Repo

TL;DR

This paper investigates using multimodal vision-language models with image representations of code to improve efficiency and maintain accuracy in code understanding tasks, showing promising results with significant compression.

Contribution

It is the first systematic study demonstrating that vision-language models can understand compressed code images effectively, reducing token costs substantially.

Findings

01

MLLMs can understand code with up to 8x token reduction.

02

Visual cues like syntax highlighting improve performance under compression.

03

Clone detection tasks are highly resilient to visual compression.

Abstract

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yerbapage/CodeOCR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.