GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
Amir Hossein Kargaran, Nafiseh Nikeghbal, Jana Diesner, Fran\c{c}ois Yvon, Hinrich Sch\"utze

TL;DR
GlotOCR Bench is a new comprehensive benchmark that evaluates OCR models across over 100 Unicode scripts, revealing limitations in generalization and reliance on language pretraining.
Contribution
This work introduces GlotOCR Bench, a large-scale multilingual OCR benchmark, and provides insights into the generalization gaps of current vision-language models.
Findings
Most models perform well on fewer than ten scripts.
Even top models fail beyond thirty scripts.
Performance correlates with script-level pretraining coverage.
Abstract
Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
