ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

X. Feng; S. Hu; X. Li; D. Zhang; M. Wu; J. Zhang; X. Chen; K. Huang

arXiv:2507.19875·cs.CV·July 29, 2025

ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, K. Huang

PDF

Open Access

TL;DR

ATCTrack introduces a novel multimodal tracking approach that aligns dynamic target states with visual and textual cues, significantly improving robustness in complex long-term vision-language tracking scenarios.

Contribution

The paper proposes a new tracker that models target-context features for better alignment with dynamic target states, incorporating temporal visual modeling and textual context calibration.

Findings

01

Achieves new state-of-the-art performance on mainstream benchmarks.

02

Effective temporal visual target-context modeling enhances tracking robustness.

03

Textual target words identification improves cue utilization.

Abstract

Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Human Pose and Action Recognition