Visual Language Tracking with Multi-modal Interaction: A Robust Benchmark
Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing, Zhang, Kaiqi Huang

TL;DR
This paper introduces VLT-MI, a new benchmark for visual language tracking that incorporates multi-round multi-modal interactions, enabling more realistic and robust human-machine tracking scenarios.
Contribution
The paper proposes a novel benchmark and interaction paradigm for VLT, integrating multi-round interactions and text updates to improve robustness and evaluation.
Findings
Enhanced tracker robustness through multi-round interaction
Generation of diverse multi-granularity texts using LLMs
Improved evaluation of multi-modal tracking accuracy
Abstract
Visual Language Tracking (VLT) enhances tracking by mitigating the limitations of relying solely on the visual modality, utilizing high-level semantic information through language. This integration of the language enables more advanced human-machine interaction. The essence of interaction is cognitive alignment, which typically requires multiple information exchanges, especially in the sequential decision-making process of VLT. However, current VLT benchmarks do not account for multi-round interactions during tracking. They provide only an initial text and bounding box (bbox) in the first frame, with no further interaction as tracking progresses, deviating from the original motivation of the VLT task. To address these limitations, we propose a novel and robust benchmark, VLT-MI (Visual Language Tracking with Multi-modal Interaction), which introduces multi-round interaction into the VLT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media
