VimTS: A Unified Video and Image Text Spotter for Enhancing the   Cross-domain Generalization

Yuliang Liu; Mingxin Huang; Hao Yan; Linger Deng; Weijia Wu; Hao Lu,; Chunhua Shen; Lianwen Jin; Xiang Bai

arXiv:2404.19652·cs.CV·December 6, 2024

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu,, Chunhua Shen, Lianwen Jin, Xiang Bai

PDF

Open Access 1 Repo

TL;DR

VimTS is a novel multi-task model that significantly improves cross-domain text spotting in images and videos by leveraging task-specific modules and synthetic datasets, outperforming state-of-the-art methods with fewer parameters.

Contribution

The paper introduces VimTS, a unified framework with prompt query generation and task-aware adapters, enhancing cross-domain generalization for text spotting in images and videos.

Findings

01

Outperforms state-of-the-art by 2.6% on six benchmarks.

02

Surpasses previous video spotting methods by 5.5% MOTA.

03

Uses synthetic dataset VTD-368k to learn temporal information efficiently.

Abstract

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yuliang-Liu/VimTS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Video Analysis and Summarization

MethodsAdapter