Pure Transformer with Integrated Experts for Scene Text Recognition

Yew Lee Tan; Adams Wai-kin Kong; Jung-Jae Kim

arXiv:2211.04963·cs.CV·November 10, 2022

Pure Transformer with Integrated Experts for Scene Text Recognition

Yew Lee Tan, Adams Wai-kin Kong, Jung-Jae Kim

PDF

TL;DR

This paper introduces PTIE, a pure transformer model with integrated experts for scene text recognition, which processes multiple patch resolutions and bidirectional decoding, achieving state-of-the-art results across multiple benchmarks.

Contribution

The paper proposes a novel pure transformer model with integrated experts that handles multiple patch resolutions and bidirectional decoding, improving scene text recognition performance.

Findings

01

Outperforms hybrid CNN-transformer models in STR.

02

Achieves state-of-the-art results on 7 benchmarks.

03

Effectively processes images with varying aspect ratios.

Abstract

Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Label Smoothing · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Multi-Head Attention · Adam · Absolute Position Encodings · Layer Normalization