NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
Fenfen Sheng, Zhineng Chen, Bo Xu

TL;DR
NRTR introduces a novel no-recurrence, self-attention based sequence-to-sequence model for scene text recognition, achieving state-of-the-art results with significantly faster training times by eliminating recurrence and convolution.
Contribution
The paper presents the first no-recurrence, self-attention based scene text recognizer, reducing complexity and training time while maintaining high accuracy.
Findings
Achieves state-of-the-art or competitive performance on benchmarks.
Requires at least 8 times less training time than previous models.
Effectively handles regular and irregular scene text.
Abstract
Scene text recognition has attracted a great many researches due to its importance to various applications. Existing methods mainly adopt recurrence or convolution based networks. Though have obtained good performance, these methods still suffer from two limitations: slow training speed due to the internal recurrence of RNNs, and high complexity due to stacked convolutional layers for long-term feature extraction. This paper, for the first time, proposes a no-recurrence sequence-to-sequence text recognizer, named NRTR, that dispenses with recurrences and convolutions entirely. NRTR follows the encoder-decoder paradigm, where the encoder uses stacked self-attention to extract image features, and the decoder applies stacked self-attention to recognize texts based on encoder output. NRTR relies solely on self-attention mechanism thus could be trained with more parallelization and less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Image Processing and 3D Reconstruction
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Convolution
