Lightweight RGB-T Tracking with Mobile Vision Transformers

Mahdi Falaki; Maria A. Amer

arXiv:2506.19154·cs.CV·February 4, 2026

Lightweight RGB-T Tracking with Mobile Vision Transformers

Mahdi Falaki, Maria A. Amer

PDF

Open Access

TL;DR

This paper introduces a lightweight, real-time RGB-T tracking model based on MobileViT that effectively combines multimodal cues for improved accuracy under challenging conditions, suitable for embedded and mobile devices.

Contribution

The paper presents the first MobileViT-based multimodal tracker with a novel progressive fusion framework, achieving high accuracy with under 4 million parameters and real-time speeds.

Findings

01

Achieves 25.7 FPS on CPU and 122 FPS on GPU

02

Uses under 4 million parameters for efficiency

03

Demonstrates improved robustness in challenging conditions

Abstract

Single-modality tracking (RGB-only) struggles under low illumination, weather, and occlusion. Multimodal tracking addresses this by combining complementary cues. While Vision Transformer-based trackers achieve strong accuracy, they are often too large for real-time. We propose a lightweight RGB-T tracker built on MobileViT with a progressive fusion framework that models intra- and inter-modal interactions using separable mixed attention. This design delivers compact, effective features for accurate localization, with under 4M parameters and real-time performance of 25.7 FPS on the CPU and 122 FPS on the GPU, supporting embedded and mobile platforms. To the best of our knowledge, this is the first MobileViT-based multimodal tracker. Model code and weights are available in the GitHub repository.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · CCD and CMOS Imaging Sensors

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings