Improving Visual Object Tracking through Visual Prompting
Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

TL;DR
This paper introduces PiVOT, a novel visual prompting mechanism leveraging the CLIP foundation model to dynamically generate and refine prompts, significantly enhancing the discriminative ability of generic object trackers against distractors.
Contribution
The paper proposes a new visual prompting approach for object tracking that uses CLIP to automatically generate and refine prompts online, improving distractor suppression and tracking accuracy.
Findings
PiVOT improves tracking performance across multiple benchmarks.
The method effectively suppresses distractors during tracking.
Extensive experiments validate the effectiveness of the proposed prompting mechanism.
Abstract
Learning a discriminative model that distinguishes the specified target from surrounding distractors across frames is essential for generic object tracking (GOT). Dynamic adaptation of target representation against distractors remains challenging because prevailing trackers exhibit limited discriminative capability. To address this issue, we present a new visual prompting mechanism for generic object tracking, termed PiVOT. PiVOT introduces mechanisms that leverage the pretrained foundation model (CLIP) to automatically generate and refine visual prompts online, thereby enabling the tracker to suppress distractors through contrastive guidance. To transfer contrastive knowledge from the foundation model to the tracker, PiVOT automatically propagates this knowledge online and dynamically generates and updates visual prompts. Specifically, it proposes a prompt initialization mechanism that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Video Surveillance and Tracking Methods · Visual Attention and Saliency Detection
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Layer Normalization · Dense Connections · Residual Connection · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training
