All in One: Exploring Unified Vision-Language Tracking with Multi-Modal   Alignment

Chunhui Zhang; and Xin Sun; and Yiqian Yang; and Li Liu; and Qiong; Liu; and Xi Zhou; and Yanfeng Wang

arXiv:2307.03373·cs.CV·March 3, 2025

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Chunhui Zhang, and Xin Sun, and Yiqian Yang, and Li Liu, and Qiong, Liu, and Xi Zhou, and Yanfeng Wang

PDF

Open Access

TL;DR

This paper introduces an All-in-One vision-language tracking framework using a unified transformer backbone that integrates feature extraction and interaction, improving efficiency and performance in complex scenarios.

Contribution

It proposes a unified transformer-based architecture for VL tracking that combines feature extraction and fusion, along with a multi-modal alignment module for better representations.

Findings

01

Outperforms state-of-the-art on five benchmarks

02

Simplifies architecture by removing separate fusion modules

03

Enhances target-aware capability in complex scenarios

Abstract

Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques