CLDTracker: A Comprehensive Language Description for Visual Tracking

Mohamad Alansari; Sajid Javed; Iyyakutti Iyappan Ganapathi; Sara Alansari; and Muzammal Naseer

arXiv:2505.23704·cs.CV·May 30, 2025

CLDTracker: A Comprehensive Language Description for Visual Tracking

Mohamad Alansari, Sajid Javed, Iyyakutti Iyappan Ganapathi, Sara Alansari, and Muzammal Naseer

PDF

Open Access 1 Repo

TL;DR

CLDTracker introduces a dual-branch framework leveraging rich, temporally-adaptive language descriptions and visual features, achieving state-of-the-art results in visual object tracking by effectively integrating semantic and visual cues.

Contribution

The paper presents a novel dual-branch architecture that constructs comprehensive textual descriptions using VLMs, enhancing visual tracking performance.

Findings

01

Achieves SOTA performance on six VOT benchmarks.

02

Effectively integrates semantic and visual features for robust tracking.

03

Demonstrates the benefits of temporally-adaptive language representations.

Abstract

VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object's nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target's evolving appearance in the language domain,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hamadya/cldtracker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training