Context-Aware Integration of Language and Visual References for Natural Language Tracking
Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, Jiming Chen

TL;DR
This paper introduces a joint multi-modal tracking framework that combines language and visual cues through prompt modulation and unified decoding, improving accuracy and reducing drift in natural language video tracking.
Contribution
The paper presents a novel end-to-end multi-modal tracking method that effectively integrates language and visual information to enhance tracking accuracy and robustness.
Findings
Achieves competitive results on TNL2K, OTB-Lang, LaSOT, and RefCOCOg datasets.
Outperforms existing methods in tracking and grounding tasks.
Demonstrates robustness to language-visual misalignments.
Abstract
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Semantic Web and Ontologies
