SVTR: Scene Text Recognition with a Single Visual Model
Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin and, Tianlun Zheng, Chenxia Li, Yuning Du, Yu-Gang Jiang

TL;DR
SVTR introduces a simplified scene text recognition model that eliminates the need for sequence modeling by using a hierarchical patch-wise image tokenization approach, achieving high accuracy and efficiency on English and Chinese datasets.
Contribution
The paper presents a novel single visual model, SVTR, that replaces traditional hybrid architectures with a patch-wise tokenization framework for scene text recognition.
Findings
SVTR-L achieves competitive accuracy on English text recognition.
SVTR outperforms existing methods in Chinese text recognition.
SVTR-T offers a smaller, faster model with good accuracy.
Abstract
Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Text and Document Classification Technologies
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
