Visual Prompt Multi-Modal Tracking
Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu

TL;DR
This paper introduces ViPT, a visual prompt learning approach that adapts pre-trained RGB models to multi-modal tracking tasks with minimal trainable parameters, outperforming full fine-tuning methods.
Contribution
Develops a novel visual prompt learning method for multi-modal tracking that requires fewer parameters and achieves state-of-the-art results across various modalities.
Findings
ViPT outperforms full fine-tuning on multiple multi-modal tracking tasks.
It uses less than 1% of the model parameters for training.
Achieves state-of-the-art performance with parameter efficiency.
Abstract
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gaze Tracking and Assistive Technology · Advanced Computing and Algorithms
