Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
Jonathan Steinberg, Oren Gal

TL;DR
This paper investigates where OCR information enters vision-language models, revealing architecture-specific bottlenecks, low-dimensional OCR signals, and that OCR removal can sometimes enhance performance.
Contribution
It identifies architecture-dependent OCR bottlenecks, demonstrates shared text-processing pathways via PCA transfer, and shows OCR removal can improve counting in modular models.
Findings
DeepStack models peak at mid-depth for scene text
PCA captures up to 72.9% of OCR variance
OCR removal improves counting by up to 6.9 percentage points in some models
Abstract
Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures up to 72.9% of variance. Crucially, principal component analysis (PCA) directions learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
