Emergence of Text Readability in Vision Language Models

Jaeyoo Park; Sanghyuk Chun; Wonjae Kim; Sangdoo Yun; Bohyung Han

arXiv:2506.19389·cs.CV·June 25, 2025

Emergence of Text Readability in Vision Language Models

Jaeyoo Park, Sanghyuk Chun, Wonjae Kim, Sangdoo Yun, Bohyung Han

PDF

Open Access

TL;DR

This paper studies how vision-language models suddenly develop the ability to read text in images after extensive training, contrasting with the gradual development of semantic understanding, and highlights the need for specialized training strategies.

Contribution

It reveals the abrupt emergence of text readability in VLMs during training and discusses implications for improving multimodal learning strategies.

Findings

01

Text readability emerges abruptly after many training iterations.

02

Semantic understanding develops gradually from early training stages.

03

Matching images with rendered text develops even more slowly.

Abstract

We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the early stages of training. This delayed emergence may reflect how contrastive learning tends to initially prioritize general semantic understanding, with text-specific symbolic processing developing later. Interestingly, the ability to match images with rendered text develops even slower, indicating a deeper need for semantic integration. These findings highlight the need for tailored training strategies to accelerate robust text comprehension in VLMs, laying the groundwork for future research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques