Pure Transformer with Integrated Experts for Scene Text Recognition
Yew Lee Tan, Adams Wai-kin Kong, Jung-Jae Kim

TL;DR
This paper introduces PTIE, a pure transformer model with integrated experts for scene text recognition, which processes multiple patch resolutions and bidirectional decoding, achieving state-of-the-art results across multiple benchmarks.
Contribution
The paper proposes a novel pure transformer model with integrated experts that handles multiple patch resolutions and bidirectional decoding, improving scene text recognition performance.
Findings
Outperforms hybrid CNN-transformer models in STR.
Achieves state-of-the-art results on 7 benchmarks.
Effectively processes images with varying aspect ratios.
Abstract
Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Label Smoothing · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Multi-Head Attention · Adam · Absolute Position Encodings · Layer Normalization
