Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models
Heng Zhou, Ao Yu, Li Kang, Yuchen Fan, Yutao Fan, Xiufeng Song, Hejia Geng, Yiran Qin

TL;DR
This paper investigates the typography recognition gap in vision-language models, revealing strengths in color detection but weaknesses in font style recognition, and proposes fine-tuning methods to improve performance.
Contribution
It systematically evaluates typography recognition in VLMs, identifies the limitations, and demonstrates effective fine-tuning strategies to enhance typographic understanding.
Findings
Color recognition is near-perfect across models.
Font style detection remains largely poor.
Fine-tuning improves font size recognition significantly.
Abstract
Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Subtitles and Audiovisual Media · Multimodal Machine Learning Applications
