Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang

TL;DR
This paper introduces CTVLT, a novel method that converts textual cues into visual heatmaps to improve vision-language tracking, achieving state-of-the-art results by better aligning text and image modalities.
Contribution
The paper proposes a plug-and-play approach that transforms textual cues into visual heatmaps, enhancing the alignment between text and visual data in tracking tasks.
Findings
Achieves state-of-the-art performance on mainstream benchmarks.
Effectively converts textual cues into visual heatmaps for better tracking.
Demonstrates improved alignment between text and image modalities.
Abstract
Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition
MethodsHeatmap · ALIGN
