VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer
Humen Zhong, Zhibo Yang, Zhaohai Li, Peng Wang, Jun Tang, Wenqing, Cheng, Cong Yao

TL;DR
VL-Reader introduces a novel vision-language reconstruction approach for scene text recognition, leveraging masked autoencoding to improve accuracy by modeling visual and semantic information jointly.
Contribution
The paper proposes VL-Reader, a new scene text recognition model that employs masked visual-linguistic reconstruction and bi-modal feature interaction, maintaining consistency from pre-training to fine-tuning.
Findings
Achieves 97.1% accuracy on six datasets, surpassing SOTA by 1.1%.
Significantly improves recognition on challenging datasets.
Demonstrates effectiveness of vision-language reconstructor in scene text recognition.
Abstract
Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distributions; (2) a decoder that ensures the alignment between vision and semantics; and (3) consistency in the framework during pre-training, if it exists, and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process. Concretely, we first introduce a Masked Visual-Linguistic Reconstruction (MVLR) objective, which aims at simultaneously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
