Scene Text Recognition via Transformer

Xinjie Feng; Hongxun Yao; Yuankai Qi; Jun Zhang; and Shengping Zhang

arXiv:2003.08077·cs.CV·April 30, 2020·6 cites

Scene Text Recognition via Transformer

Xinjie Feng, Hongxun Yao, Yuankai Qi, Jun Zhang, and Shengping Zhang

PDF

Open Access

TL;DR

This paper introduces a transformer-based scene text recognition method that eliminates the need for image rectification, achieving state-of-the-art accuracy on challenging datasets by leveraging spatial attention.

Contribution

The proposed method uniquely uses convolutional feature maps as transformer input, avoiding rectification and significantly improving recognition accuracy.

Findings

01

Achieves 99.3% accuracy on CUTE dataset

02

Outperforms existing methods by a large margin

03

Eliminates the need for image rectification in scene text recognition

Abstract

Scene text recognition with arbitrary shape is very challenging due to large variations in text shapes, fonts, colors, backgrounds, etc. Most state-of-the-art algorithms rectify the input image into the normalized image, then treat the recognition as a sequence prediction task. The bottleneck of such methods is the rectification, which will cause errors due to distortion perspective. In this paper, we find that the rectification is completely unnecessary. What all we need is the spatial attention. We therefore propose a simple but extremely effective scene text recognition method based on transformer [50]. Different from previous transformer based models [56,34], which just use the decoder of the transformer to decode the convolutional attention, the proposed method use a convolutional feature maps as word embedding input into transformer. In such a way, our method is able to make full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax