Masked Vision-Language Transformers for Scene Text Recognition
Jie Wu, Ying Peng, Shengming Zhang, Weigang Qi, Jian Zhang

TL;DR
This paper introduces Masked Vision-Language Transformers (MVLT), a novel model for scene text recognition that leverages both visual and linguistic information through a two-stage training process, achieving superior benchmark performance.
Contribution
The paper proposes a new MVLT model combining vision and language transformers with a specialized pretraining and iterative correction, advancing scene text recognition capabilities.
Findings
MVLT outperforms state-of-the-art models on multiple benchmarks.
The two-stage training enhances recognition accuracy.
The model effectively captures explicit and implicit linguistic cues.
Abstract
Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Adam · Absolute Position Encodings · Layer Normalization
