Masked Vision-Language Transformers for Scene Text Recognition

Jie Wu; Ying Peng; Shengming Zhang; Weigang Qi; Jian Zhang

arXiv:2211.04785·cs.CV·November 10, 2022·5 cites

Masked Vision-Language Transformers for Scene Text Recognition

Jie Wu, Ying Peng, Shengming Zhang, Weigang Qi, Jian Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Masked Vision-Language Transformers (MVLT), a novel model for scene text recognition that leverages both visual and linguistic information through a two-stage training process, achieving superior benchmark performance.

Contribution

The paper proposes a new MVLT model combining vision and language transformers with a specialized pretraining and iterative correction, advancing scene text recognition capabilities.

Findings

01

MVLT outperforms state-of-the-art models on multiple benchmarks.

02

The two-stage training enhances recognition accuracy.

03

The model effectively captures explicit and implicit linguistic cues.

Abstract

Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

onealwj/mvlt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Adam · Absolute Position Encodings · Layer Normalization