TL;DR
VPTracker introduces a global vision-language tracking framework leveraging multimodal large language models and spatial priors, significantly improving robustness and disambiguation in challenging scenarios.
Contribution
It is the first to utilize global search with MLLMs for vision-language tracking, incorporating spatial priors to enhance accuracy and reduce distractions.
Findings
Significantly improves tracking stability under occlusions and viewpoint changes.
Reduces false positives from similar objects through spatial priors.
Demonstrates superior performance compared to local search methods.
Abstract
Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
