DeepSolo++: Let Transformer Decoder with Explicit Points Solo for   Multilingual Text Spotting

Maoyuan Ye; Jing Zhang; Shanshan Zhao; Juhua Liu; Tongliang Liu; Bo; Du; Dacheng Tao

arXiv:2305.19957·cs.CV·March 19, 2024·1 cites

DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting

Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo, Du, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

DeepSolo++ introduces a unified Transformer-based framework for multilingual end-to-end text spotting that simultaneously handles detection, recognition, and script identification with high efficiency and extensibility.

Contribution

It proposes a single-decoder model with explicit points for integrated multilingual text detection, recognition, and script ID, improving performance and training simplicity.

Findings

01

Effective in English and Chinese scenes.

02

Handles complex fonts and large character classes.

03

Outperforms previous methods in script identification accuracy.

Abstract

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. Besides, they overlook the exploring on multilingual text spotting which requires an extra script identification task. In this paper, we present DeepSolo++, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vitae-transformer/deepsolo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Video Analysis and Summarization

MethodsAttention Is All You Need · Dropout · Residual Connection · Linear Layer · Layer Normalization · Byte Pair Encoding · Softmax · Label Smoothing · Absolute Position Encodings · Multi-Head Attention