Unified Single-Stage Transformer Network for Efficient RGB-T Tracking
Jianqiang Xia, DianXi Shi, Ke Song, Linna Song, XiaoLei Wang,, Songchang Jin, Li Zhou, Yu Cheng, Lei Jin, Zheng Zhu, Jianan Li, Gang Wang,, Junliang Xing, Jian Zhao

TL;DR
The paper introduces USTrack, a unified single-stage Transformer network for RGB-T tracking that enhances feature interaction, improves accuracy, and achieves real-time speed by integrating multiple stages into a single ViT backbone.
Contribution
It proposes a novel unified single-stage Transformer architecture with a dual embedding layer and modality reliability mechanism for efficient and accurate RGB-T tracking.
Findings
Achieves state-of-the-art performance on three benchmarks.
Maintains real-time inference speed of 84.2 FPS.
Significantly improves MPR/MSR metrics on VTUAV dataset.
Abstract
Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network's ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone with a dual embedding layer through self-attention mechanism. With this structure, the network can extract fusion features of the template and search region under the mutual interaction of modalities. Simultaneously, relation modeling is performed between these features,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Infrared Target Detection Methodologies · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Dropout · Feature Selection · Byte Pair Encoding · Adam
