What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels
Jeonghun Baek, Yusuke Matsui, Kiyoharu Aizawa

TL;DR
This paper demonstrates that scene text recognition models can be effectively trained using only real labeled data, supplemented with data augmentation and semi/self-supervised learning, challenging the belief that synthetic data is essential.
Contribution
It is the first to show competitive STR performance with only real data and introduces semi- and self-supervised methods into this setting.
Findings
Models trained on real data alone achieve competitive accuracy.
Data augmentation and semi/self-supervised methods significantly improve performance.
The study challenges the necessity of synthetic data for STR training.
Abstract
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data. In contrast to this practice, training STR models only on fewer real labels (STR with fewer labels) is important when we have to train STR models without synthetic data: for handwritten or artistic texts that are difficult to generate synthetically and for languages other than English for which we do not always have synthetic data. However, there has been implicit common knowledge that training STR models on real data is nearly impossible because real data is insufficient. We consider that this common knowledge has obstructed the study of STR with fewer labels. In this work, we would like to reactivate STR with fewer labels by disproving the common knowledge. We consolidate recently accumulated public real data and show that we can train STR models satisfactorily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Multimodal Machine Learning Applications
