MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu,, Liang Wu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

TL;DR
MaskOCR introduces a unified vision-language pretraining framework for text recognition that leverages masked image-language modeling, resulting in improved performance on Chinese and English text image benchmarks.
Contribution
The paper proposes MaskOCR, a novel encoder-decoder pretraining method that unifies visual and linguistic learning for text recognition using masked image-language modeling.
Findings
Achieves superior accuracy on benchmark datasets.
Effectively learns visual representations from unlabeled images.
Enhances language modeling without additional language models.
Abstract
Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional language model, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Natural Language Processing Techniques
