TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Yue Tao; Zhiwei Jia; Runze Ma; Shugong Xu

arXiv:2111.08314·cs.CV·November 17, 2021

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Yue Tao, Zhiwei Jia, Runze Ma, Shugong Xu

PDF

TL;DR

TRIG introduces a transformer-based architecture with learnable initial embeddings for scene text recognition, achieving state-of-the-art results by efficiently capturing global context without additional modules.

Contribution

The paper proposes a novel transformer-based text recognizer with adaptive initial embeddings, replacing CNNs and improving accuracy and efficiency in scene text recognition.

Findings

01

Achieves state-of-the-art performance on benchmarks.

02

Reduces complexity with 1-D split transformer encoder.

03

Improves decoding accuracy with learnable initial embeddings.

Abstract

Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main shortcomings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.