Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg; Oren Gal

arXiv:2602.22918·cs.CL·May 18, 2026

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg, Oren Gal

PDF

TL;DR

This paper investigates where OCR information enters vision-language models, revealing architecture-specific bottlenecks, low-dimensional OCR signals, and that OCR removal can sometimes enhance performance.

Contribution

It identifies architecture-dependent OCR bottlenecks, demonstrates shared text-processing pathways via PCA transfer, and shows OCR removal can improve counting in modular models.

Findings

01

DeepStack models peak at mid-depth for scene text

02

PCA captures up to 72.9% of OCR variance

03

OCR removal improves counting by up to 6.9 percentage points in some models

Abstract

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures up to 72.9% of variance. Crucially, principal component analysis (PCA) directions learned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.