ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer
Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang, Liu, Xiang Bai, Lianwen Jin

TL;DR
ESTextSpotter introduces an explicit synergy mechanism within a Transformer framework for scene text spotting, modeling task-specific features for detection and recognition to significantly enhance performance.
Contribution
The paper proposes a novel explicit synergy approach with task-aware queries and a vision-language communication module, improving over implicit methods in scene text spotting.
Findings
Outperforms previous state-of-the-art methods
Significantly improves text detection and recognition accuracy
Demonstrates the effectiveness of explicit synergy in Transformer models
Abstract
In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown the crucial importance of the intrinsic synergy between text detection and recognition, recent advances in Transformer-based methods usually adopt an implicit synergy strategy with shared query, which can not fully realize the potential of these two interactive tasks. In this paper, we argue that the explicit synergy considering distinct characteristics of text detection and recognition can significantly improve the performance text spotting. To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder. Specifically, we decompose the conventional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer· youtube
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections
