TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance
Yue Tao, Zhiwei Jia, Runze Ma, Shugong Xu

TL;DR
TRIG introduces a transformer-based architecture with learnable initial embeddings for scene text recognition, achieving state-of-the-art results by efficiently capturing global context without additional modules.
Contribution
The paper proposes a novel transformer-based text recognizer with adaptive initial embeddings, replacing CNNs and improving accuracy and efficiency in scene text recognition.
Findings
Achieves state-of-the-art performance on benchmarks.
Reduces complexity with 1-D split transformer encoder.
Improves decoding accuracy with learnable initial embeddings.
Abstract
Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main shortcomings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
