VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu,, Chunhua Shen, Lianwen Jin, Xiang Bai

TL;DR
VimTS is a novel multi-task model that significantly improves cross-domain text spotting in images and videos by leveraging task-specific modules and synthetic datasets, outperforming state-of-the-art methods with fewer parameters.
Contribution
The paper introduces VimTS, a unified framework with prompt query generation and task-aware adapters, enhancing cross-domain generalization for text spotting in images and videos.
Findings
Outperforms state-of-the-art by 2.6% on six benchmarks.
Surpasses previous video spotting methods by 5.5% MOTA.
Uses synthetic dataset VTD-368k to learn temporal information efficiently.
Abstract
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Video Analysis and Summarization
MethodsAdapter
