End-to-End Video Text Spotting with Transformer

Weijia Wu; Yuanqiang Cai; Chunhua Shen; Debing Zhang; Ying Fu; Hong; Zhou; Ping Luo

arXiv:2203.10539·cs.CV·August 23, 2022·6 cites

End-to-End Video Text Spotting with Transformer

Weijia Wu, Yuanqiang Cai, Chunhua Shen, Debing Zhang, Ying Fu, Hong, Zhou, Ping Luo

PDF

Open Access 1 Repo

TL;DR

This paper introduces TransDETR, an end-to-end Transformer-based framework for video text spotting that simultaneously detects, tracks, and recognizes text across multiple frames, achieving state-of-the-art results.

Contribution

TransDETR is the first end-to-end trainable video text spotting framework that implicitly tracks and recognizes text using long-range temporal queries.

Findings

01

Achieves up to 8.0% improvement on video text spotting benchmarks.

02

Outperforms existing methods on four video text datasets.

03

Demonstrates effective long-range temporal modeling for text tracking.

Abstract

Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. These methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines. In this paper, rooted in Transformer sequence modeling, we propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR). TransDETR mainly includes two advantages: 1) Different from the explicit match paradigm in the adjacent frame, TransDETR tracks and recognizes each text implicitly by the different query termed text query over long-range temporal sequence (more than 7 frames). 2) TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

weijiawu/transdetr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Digital Media Forensic Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout