Divert More Attention to Vision-Language Object Tracking

Mingzhe Guo; Zhipeng Zhang; Liping Jing; Haibin Ling; Heng Fan

arXiv:2307.10046·cs.CV·July 20, 2023

Divert More Attention to Vision-Language Object Tracking

Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, Heng Fan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a large-scale vision-language tracking database and a novel framework that enhances tracking performance by learning unified adaptive VL representations, significantly improving multiple baseline methods across six benchmarks.

Contribution

The paper creates a large vision-language annotated video database and proposes a new VL representation learning framework with asymmetric architecture search and modality mixing.

Findings

01

Significant performance improvements on six benchmarks.

02

Effective VL representation enhances various tracking models.

03

Theoretical analysis supports the approach's rationality.

Abstract

Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JudasDie/SOTS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedia, Religion, Digital Communication · Text and Document Classification Technologies · Advanced Image and Video Retrieval Techniques

MethodsALIGN