Divert More Attention to Vision-Language Object Tracking
Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, Heng Fan

TL;DR
This paper introduces a large-scale vision-language tracking database and a novel framework that enhances tracking performance by learning unified adaptive VL representations, significantly improving multiple baseline methods across six benchmarks.
Contribution
The paper creates a large vision-language annotated video database and proposes a new VL representation learning framework with asymmetric architecture search and modality mixing.
Findings
Significant performance improvements on six benchmarks.
Effective VL representation enhances various tracking models.
Theoretical analysis supports the approach's rationality.
Abstract
Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedia, Religion, Digital Communication · Text and Document Classification Technologies · Advanced Image and Video Retrieval Techniques
MethodsALIGN
