Mobile Vision Transformer-based Visual Object Tracking
Goutam Yelluru Gopal, Maria A. Amer

TL;DR
This paper introduces a lightweight, accurate, and fast visual object tracking algorithm using Mobile Vision Transformers, which outperforms existing lightweight trackers and even larger models in accuracy and speed.
Contribution
First application of MobileViT backbone in object tracking, with a novel fusion approach for improved feature encoding and performance.
Findings
Outperforms recent lightweight trackers on GOT10k and TrackingNet datasets.
Surpasses DiMP-50 in accuracy while using fewer parameters and faster inference.
Achieves high speed and accuracy balance suitable for real-time applications.
Abstract
The introduction of robust backbones, such as Vision Transformers, has improved the performance of object tracking algorithms in recent years. However, these state-of-the-art trackers are computationally expensive since they have a large number of model parameters and rely on specialized hardware (e.g., GPU) for faster inference. On the other hand, recent lightweight trackers are fast but are less accurate, especially on large-scale datasets. We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time. We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization. The experimental results show that our MobileViT-based Tracker, MVT, surpasses the performance of recent lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsMobileViT · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
