ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking
X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, K. Huang

TL;DR
ATCTrack introduces a novel multimodal tracking approach that aligns dynamic target states with visual and textual cues, significantly improving robustness in complex long-term vision-language tracking scenarios.
Contribution
The paper proposes a new tracker that models target-context features for better alignment with dynamic target states, incorporating temporal visual modeling and textual context calibration.
Findings
Achieves new state-of-the-art performance on mainstream benchmarks.
Effective temporal visual target-context modeling enhances tracking robustness.
Textual target words identification improves cue utilization.
Abstract
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Human Pose and Action Recognition
