Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition
Xueming Yan, Zhihang Fang, Yaochu Jin

TL;DR
This paper introduces TANGER, an augmented transformer model with adaptive n-grams embedding and cross-language rectification, significantly improving multilingual scene text recognition performance across various datasets.
Contribution
The paper proposes a novel transformer architecture with adaptive n-grams embedding and cross-language rectification for better multilingual scene text recognition.
Findings
TANGER outperforms state-of-the-art methods on multiple benchmark datasets.
The model effectively handles complex multilingual scene texts.
Experimental results validate the robustness of TANGER across diverse languages.
Abstract
While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to multilingual scene text recognition due to the complexities in the visual appearance of multilingual texts. To fill the gap, this paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER). TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings that aims to flexibly explore the potential correlations between neighbouring visual patches, which is essential for feature extraction from multilingual scene texts. Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Natural Language Processing Techniques
