Explicit Visual Prompts for Visual Object Tracking
Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang,, Xianxian Li

TL;DR
EVPTrack introduces an explicit visual prompts framework utilizing spatio-temporal tokens and multi-scale information to improve visual object tracking by avoiding template updates and enhancing efficiency.
Contribution
The paper proposes a novel explicit visual prompts framework for tracking that leverages spatio-temporal tokens and multi-scale features, eliminating the need for template updating strategies.
Findings
Achieves competitive performance on six benchmarks.
Operates at real-time speed with effective exploitation of spatio-temporal info.
Improves handling of target scale changes through multi-scale prompts.
Abstract
How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the \textit{when-and-how-to-update} dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed \textbf{EVPTrack}. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of \textit{when-to-update}, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Visual Attention and Saliency Detection · Face recognition and analysis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus
