Visual Language Tracking with Multi-modal Interaction: A Robust   Benchmark

Xuchen Li; Shiyu Hu; Xiaokun Feng; Dailing Zhang; Meiqi Wu; Jing; Zhang; Kaiqi Huang

arXiv:2409.08887·cs.CV·September 16, 2024

Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark

Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing, Zhang, Kaiqi Huang

PDF

Open Access

TL;DR

This paper introduces VLT-MI, a new benchmark for visual language tracking that incorporates multi-round multi-modal interactions, enabling more realistic and robust human-machine tracking scenarios.

Contribution

The paper proposes a novel benchmark and interaction paradigm for VLT, integrating multi-round interactions and text updates to improve robustness and evaluation.

Findings

01

Enhanced tracker robustness through multi-round interaction

02

Generation of diverse multi-granularity texts using LLMs

03

Improved evaluation of multi-modal tracking accuracy

Abstract

Visual Language Tracking (VLT) enhances tracking by mitigating the limitations of relying solely on the visual modality, utilizing high-level semantic information through language. This integration of the language enables more advanced human-machine interaction. The essence of interaction is cognitive alignment, which typically requires multiple information exchanges, especially in the sequential decision-making process of VLT. However, current VLT benchmarks do not account for multi-round interactions during tracking. They provide only an initial text and bounding box (bbox) in the first frame, with no further interaction as tracking progresses, deviating from the original motivation of the VLT task. To address these limitations, we propose a novel and robust benchmark, VLT-MI (Visual Language Tracking with Multi-modal Interaction), which introduces multi-round interaction into the VLT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media