DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo, Du, Dacheng Tao

TL;DR
DeepSolo++ introduces a unified Transformer-based framework for multilingual end-to-end text spotting that simultaneously handles detection, recognition, and script identification with high efficiency and extensibility.
Contribution
It proposes a single-decoder model with explicit points for integrated multilingual text detection, recognition, and script ID, improving performance and training simplicity.
Findings
Effective in English and Chinese scenes.
Handles complex fonts and large character classes.
Outperforms previous methods in script identification accuracy.
Abstract
End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. Besides, they overlook the exploring on multilingual text spotting which requires an extra script identification task. In this paper, we present DeepSolo++, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Video Analysis and Summarization
MethodsAttention Is All You Need · Dropout · Residual Connection · Linear Layer · Layer Normalization · Byte Pair Encoding · Softmax · Label Smoothing · Absolute Position Encodings · Multi-Head Attention
