Lightweight RGB-T Tracking with Mobile Vision Transformers
Mahdi Falaki, Maria A. Amer

TL;DR
This paper introduces a lightweight, real-time RGB-T tracking model based on MobileViT that effectively combines multimodal cues for improved accuracy under challenging conditions, suitable for embedded and mobile devices.
Contribution
The paper presents the first MobileViT-based multimodal tracker with a novel progressive fusion framework, achieving high accuracy with under 4 million parameters and real-time speeds.
Findings
Achieves 25.7 FPS on CPU and 122 FPS on GPU
Uses under 4 million parameters for efficiency
Demonstrates improved robustness in challenging conditions
Abstract
Single-modality tracking (RGB-only) struggles under low illumination, weather, and occlusion. Multimodal tracking addresses this by combining complementary cues. While Vision Transformer-based trackers achieve strong accuracy, they are often too large for real-time. We propose a lightweight RGB-T tracker built on MobileViT with a progressive fusion framework that models intra- and inter-modal interactions using separable mixed attention. This design delivers compact, effective features for accurate localization, with under 4M parameters and real-time performance of 25.7 FPS on the CPU and 122 FPS on the GPU, supporting embedded and mobile platforms. To the best of our knowledge, this is the first MobileViT-based multimodal tracker. Model code and weights are available in the GitHub repository.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · CCD and CMOS Imaging Sensors
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
