How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language   in Vision-Language Tracking

Xuchen Li; Shiyu Hu; Xiaokun Feng; Dailing Zhang; Meiqi Wu; Jing; Zhang; Kaiqi Huang

arXiv:2411.15600·cs.CV·November 26, 2024

How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking

Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing, Zhang, Kaiqi Huang

PDF

Open Access

TL;DR

This paper introduces VLTVerse, a comprehensive evaluation framework for vision-language tracking that analyzes the role of semantic information across various challenge factors, revealing performance bottlenecks and guiding future improvements.

Contribution

VLTVerse is the first fine-grained evaluation framework for VLT, considering multiple challenge factors and semantic types to systematically analyze tracker performance.

Findings

01

Uncovered performance bottlenecks of SOTA VLT trackers in complex scenarios.

02

Provided insights into how different semantic types impact tracking under various challenges.

03

Offered guidance for improving VLT algorithms based on fine-grained analysis.

Abstract

Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information, providing semantic guidance to enhance tracking performance under challenging conditions like fast motion and deformations. However, current VLT trackers often underperform compared to single-modality methods on multiple benchmarks, with semantic information sometimes becoming a "distraction." To address this, we propose VLTVerse, the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT. Our contributions include: (1) VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT; (2) leveraging 60 subspaces formed by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology