TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision

Yukun Zhai; Xiaoqiang Zhang; Xiameng Qin; Sanyuan Zhao; Xingping Dong,; Jianbing Shen

arXiv:2306.03377·cs.CV·April 2, 2024·1 cites

TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision

Yukun Zhai, Xiaoqiang Zhang, Xiameng Qin, Sanyuan Zhao, Xingping Dong,, Jianbing Shen

PDF

Open Access

TL;DR

TextFormer introduces a Transformer-based, query-driven end-to-end framework for text detection and recognition that leverages mixed supervision and a novel global aggregation module, achieving superior performance on multilingual benchmarks.

Contribution

The paper presents a novel query-based Transformer architecture with an Adaptive Global Aggregation module and mixed supervision, enhancing end-to-end text spotting performance.

Findings

01

Outperforms state-of-the-art on TDA-ReCTS with 13.2% improvement in 1-NED

02

Effectively integrates detection and recognition with shared features

03

Utilizes mixed supervision to improve detection and recognition accuracy

Abstract

End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework. Typical methods heavily rely on Region-of-Interest (RoI) operations to extract local features and complex post-processing steps to produce final predictions. To address these limitations, we propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. Specifically, using query embedding per text instance, TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing without sacrificing flexibility or simplicity. Additionally, we design an Adaptive Global aGgregation (AGG) module to transfer global features into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Linear Layer · Dropout · Label Smoothing · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization