All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
Chunhui Zhang, and Xin Sun, and Yiqian Yang, and Li Liu, and Qiong, Liu, and Xi Zhou, and Yanfeng Wang

TL;DR
This paper introduces an All-in-One vision-language tracking framework using a unified transformer backbone that integrates feature extraction and interaction, improving efficiency and performance in complex scenarios.
Contribution
It proposes a unified transformer-based architecture for VL tracking that combines feature extraction and fusion, along with a multi-modal alignment module for better representations.
Findings
Outperforms state-of-the-art on five benchmarks
Simplifies architecture by removing separate fusion modules
Enhances target-aware capability in complex scenarios
Abstract
Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
