Improving Visual Object Tracking through Visual Prompting

Shih-Fang Chen; Jun-Cheng Chen; I-Hong Jhuo; Yen-Yu Lin

arXiv:2409.18901·cs.CV·March 10, 2026

Improving Visual Object Tracking through Visual Prompting

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces PiVOT, a novel visual prompting mechanism leveraging the CLIP foundation model to dynamically generate and refine prompts, significantly enhancing the discriminative ability of generic object trackers against distractors.

Contribution

The paper proposes a new visual prompting approach for object tracking that uses CLIP to automatically generate and refine prompts online, improving distractor suppression and tracking accuracy.

Findings

01

PiVOT improves tracking performance across multiple benchmarks.

02

The method effectively suppresses distractors during tracking.

03

Extensive experiments validate the effectiveness of the proposed prompting mechanism.

Abstract

Learning a discriminative model that distinguishes the specified target from surrounding distractors across frames is essential for generic object tracking (GOT). Dynamic adaptation of target representation against distractors remains challenging because prevailing trackers exhibit limited discriminative capability. To address this issue, we present a new visual prompting mechanism for generic object tracking, termed PiVOT. PiVOT introduces mechanisms that leverage the pretrained foundation model (CLIP) to automatically generate and refine visual prompts online, thereby enabling the tracker to suppress distractors through contrastive guidance. To transfer contrastive knowledge from the foundation model to the tracker, PiVOT automatically propagates this knowledge online and dynamically generates and updates visual prompts. Specifically, it proposes a prompt initialization mechanism that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenshihfang/GOT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Video Surveillance and Tracking Methods · Visual Attention and Saliency Detection

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Layer Normalization · Dense Connections · Residual Connection · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training