Mobile Vision Transformer-based Visual Object Tracking

Goutam Yelluru Gopal; Maria A. Amer

arXiv:2309.05829·cs.CV·September 13, 2023·1 cites

Mobile Vision Transformer-based Visual Object Tracking

Goutam Yelluru Gopal, Maria A. Amer

PDF

Open Access 1 Repo

TL;DR

This paper introduces a lightweight, accurate, and fast visual object tracking algorithm using Mobile Vision Transformers, which outperforms existing lightweight trackers and even larger models in accuracy and speed.

Contribution

First application of MobileViT backbone in object tracking, with a novel fusion approach for improved feature encoding and performance.

Findings

01

Outperforms recent lightweight trackers on GOT10k and TrackingNet datasets.

02

Surpasses DiMP-50 in accuracy while using fewer parameters and faster inference.

03

Achieves high speed and accuracy balance suitable for real-time applications.

Abstract

The introduction of robust backbones, such as Vision Transformers, has improved the performance of object tracking algorithms in recent years. However, these state-of-the-art trackers are computationally expensive since they have a large number of model parameters and rely on specialized hardware (e.g., GPU) for faster inference. On the other hand, recent lightweight trackers are fast but are less accurate, especially on large-scale datasets. We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time. We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization. The experimental results show that our MobileViT-based Tracker, MVT, surpasses the performance of recent lightweight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

goutamyg/mvt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsMobileViT · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings