Emergence of Text Readability in Vision Language Models
Jaeyoo Park, Sanghyuk Chun, Wonjae Kim, Sangdoo Yun, Bohyung Han

TL;DR
This paper studies how vision-language models suddenly develop the ability to read text in images after extensive training, contrasting with the gradual development of semantic understanding, and highlights the need for specialized training strategies.
Contribution
It reveals the abrupt emergence of text readability in VLMs during training and discusses implications for improving multimodal learning strategies.
Findings
Text readability emerges abruptly after many training iterations.
Semantic understanding develops gradually from early training stages.
Matching images with rendered text develops even more slowly.
Abstract
We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the early stages of training. This delayed emergence may reflect how contrastive learning tends to initially prioritize general semantic understanding, with text-specific symbolic processing developing later. Interestingly, the ability to match images with rendered text develops even slower, indicating a deeper need for semantic integration. These findings highlight the need for tailored training strategies to accelerate robust text comprehension in VLMs, laying the groundwork for future research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Natural Language Processing Techniques
