CLDTracker: A Comprehensive Language Description for Visual Tracking
Mohamad Alansari, Sajid Javed, Iyyakutti Iyappan Ganapathi, Sara Alansari, and Muzammal Naseer

TL;DR
CLDTracker introduces a dual-branch framework leveraging rich, temporally-adaptive language descriptions and visual features, achieving state-of-the-art results in visual object tracking by effectively integrating semantic and visual cues.
Contribution
The paper presents a novel dual-branch architecture that constructs comprehensive textual descriptions using VLMs, enhancing visual tracking performance.
Findings
Achieves SOTA performance on six VOT benchmarks.
Effectively integrates semantic and visual features for robust tracking.
Demonstrates the benefits of temporally-adaptive language representations.
Abstract
VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object's nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target's evolving appearance in the language domain,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
