Visual Prompt Multi-Modal Tracking

Jiawen Zhu; Simiao Lai; Xin Chen; Dong Wang; Huchuan Lu

arXiv:2303.10826·cs.CV·March 28, 2023·1 cites

Visual Prompt Multi-Modal Tracking

Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces ViPT, a visual prompt learning approach that adapts pre-trained RGB models to multi-modal tracking tasks with minimal trainable parameters, outperforming full fine-tuning methods.

Contribution

Develops a novel visual prompt learning method for multi-modal tracking that requires fewer parameters and achieves state-of-the-art results across various modalities.

Findings

01

ViPT outperforms full fine-tuning on multiple multi-modal tracking tasks.

02

It uses less than 1% of the model parameters for training.

03

Achieves state-of-the-art performance with parameter efficiency.

Abstract

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiawen-zhu/vipt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Gaze Tracking and Assistive Technology · Advanced Computing and Algorithms