SVTR: Scene Text Recognition with a Single Visual Model

Yongkun Du; Zhineng Chen; Caiyan Jia; Xiaoting Yin and; Tianlun Zheng; Chenxia Li; Yuning Du; Yu-Gang Jiang

arXiv:2205.00159·cs.CV·May 24, 2022·25 cites

SVTR: Scene Text Recognition with a Single Visual Model

Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin and, Tianlun Zheng, Chenxia Li, Yuning Du, Yu-Gang Jiang

PDF

Open Access 4 Repos

TL;DR

SVTR introduces a simplified scene text recognition model that eliminates the need for sequence modeling by using a hierarchical patch-wise image tokenization approach, achieving high accuracy and efficiency on English and Chinese datasets.

Contribution

The paper presents a novel single visual model, SVTR, that replaces traditional hybrid architectures with a patch-wise tokenization framework for scene text recognition.

Findings

01

SVTR-L achieves competitive accuracy on English text recognition.

02

SVTR outperforms existing methods in Chinese text recognition.

03

SVTR-T offers a smaller, faster model with good accuracy.

Abstract

Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Text and Document Classification Technologies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings