Enhancing Vision-Language Tracking by Effectively Converting Textual   Cues into Visual Cues

X. Feng; D. Zhang; S. Hu; X. Li; M. Wu; J. Zhang; X. Chen; K. Huang

arXiv:2412.19648·cs.CV·December 30, 2024

Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues

X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces CTVLT, a novel method that converts textual cues into visual heatmaps to improve vision-language tracking, achieving state-of-the-art results by better aligning text and image modalities.

Contribution

The paper proposes a plug-and-play approach that transforms textual cues into visual heatmaps, enhancing the alignment between text and visual data in tracking tasks.

Findings

01

Achieves state-of-the-art performance on mainstream benchmarks.

02

Effectively converts textual cues into visual heatmaps for better tracking.

03

Demonstrates improved alignment between text and image modalities.

Abstract

Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaokunfeng/ctvlt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition

MethodsHeatmap · ALIGN