Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Jianqiang Xia; DianXi Shi; Ke Song; Linna Song; XiaoLei Wang,; Songchang Jin; Li Zhou; Yu Cheng; Lei Jin; Zheng Zhu; Jianan Li; Gang Wang,; Junliang Xing; Jian Zhao

arXiv:2308.13764·cs.CV·August 29, 2023

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Jianqiang Xia, DianXi Shi, Ke Song, Linna Song, XiaoLei Wang,, Songchang Jin, Li Zhou, Yu Cheng, Lei Jin, Zheng Zhu, Jianan Li, Gang Wang,, Junliang Xing, Jian Zhao

PDF

Open Access 1 Repo

TL;DR

The paper introduces USTrack, a unified single-stage Transformer network for RGB-T tracking that enhances feature interaction, improves accuracy, and achieves real-time speed by integrating multiple stages into a single ViT backbone.

Contribution

It proposes a novel unified single-stage Transformer architecture with a dual embedding layer and modality reliability mechanism for efficient and accurate RGB-T tracking.

Findings

01

Achieves state-of-the-art performance on three benchmarks.

02

Maintains real-time inference speed of 84.2 FPS.

03

Significantly improves MPR/MSR metrics on VTUAV dataset.

Abstract

Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network's ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone with a dual embedding layer through self-attention mechanism. With this structure, the network can extract fusion features of the template and search region under the mutual interaction of modalities. Simultaneously, relation modeling is performed between these features,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiajianqiang/USTrack
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Infrared Target Detection Methodologies · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Dropout · Feature Selection · Byte Pair Encoding · Adam