Scene Text Recognition via Transformer
Xinjie Feng, Hongxun Yao, Yuankai Qi, Jun Zhang, and Shengping Zhang

TL;DR
This paper introduces a transformer-based scene text recognition method that eliminates the need for image rectification, achieving state-of-the-art accuracy on challenging datasets by leveraging spatial attention.
Contribution
The proposed method uniquely uses convolutional feature maps as transformer input, avoiding rectification and significantly improving recognition accuracy.
Findings
Achieves 99.3% accuracy on CUTE dataset
Outperforms existing methods by a large margin
Eliminates the need for image rectification in scene text recognition
Abstract
Scene text recognition with arbitrary shape is very challenging due to large variations in text shapes, fonts, colors, backgrounds, etc. Most state-of-the-art algorithms rectify the input image into the normalized image, then treat the recognition as a sequence prediction task. The bottleneck of such methods is the rectification, which will cause errors due to distortion perspective. In this paper, we find that the rectification is completely unnecessary. What all we need is the spatial attention. We therefore propose a simple but extremely effective scene text recognition method based on transformer [50]. Different from previous transformer based models [56,34], which just use the decoder of the transformer to decode the convolutional attention, the proposed method use a convolutional feature maps as word embedding input into transformer. In such a way, our method is able to make full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
