How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking
Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing, Zhang, Kaiqi Huang

TL;DR
This paper introduces VLTVerse, a comprehensive evaluation framework for vision-language tracking that analyzes the role of semantic information across various challenge factors, revealing performance bottlenecks and guiding future improvements.
Contribution
VLTVerse is the first fine-grained evaluation framework for VLT, considering multiple challenge factors and semantic types to systematically analyze tracker performance.
Findings
Uncovered performance bottlenecks of SOTA VLT trackers in complex scenarios.
Provided insights into how different semantic types impact tracking under various challenges.
Offered guidance for improving VLT algorithms based on fine-grained analysis.
Abstract
Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information, providing semantic guidance to enhance tracking performance under challenging conditions like fast motion and deformations. However, current VLT trackers often underperform compared to single-modality methods on multiple benchmarks, with semantic information sometimes becoming a "distraction." To address this, we propose VLTVerse, the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT. Our contributions include: (1) VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT; (2) leveraging 60 subspaces formed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology
