VL-Reader: Vision and Language Reconstructor is an Effective Scene Text   Recognizer

Humen Zhong; Zhibo Yang; Zhaohai Li; Peng Wang; Jun Tang; Wenqing; Cheng; Cong Yao

arXiv:2409.11656·cs.CV·September 19, 2024

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Humen Zhong, Zhibo Yang, Zhaohai Li, Peng Wang, Jun Tang, Wenqing, Cheng, Cong Yao

PDF

Open Access

TL;DR

VL-Reader introduces a novel vision-language reconstruction approach for scene text recognition, leveraging masked autoencoding to improve accuracy by modeling visual and semantic information jointly.

Contribution

The paper proposes VL-Reader, a new scene text recognition model that employs masked visual-linguistic reconstruction and bi-modal feature interaction, maintaining consistency from pre-training to fine-tuning.

Findings

01

Achieves 97.1% accuracy on six datasets, surpassing SOTA by 1.1%.

02

Significantly improves recognition on challenging datasets.

03

Demonstrates effectiveness of vision-language reconstructor in scene text recognition.

Abstract

Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distributions; (2) a decoder that ensures the alignment between vision and semantics; and (3) consistency in the framework during pre-training, if it exists, and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process. Concretely, we first introduce a Masked Visual-Linguistic Reconstruction (MVLR) objective, which aims at simultaneously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques