Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

Yunfeng Li; Bo Wang; Ye Li

arXiv:2405.03177·cs.CV·June 24, 2025

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

Yunfeng Li, Bo Wang, Ye Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces CSTNet, a transformer-based RGB-T tracking model that employs novel channel and spatial feature fusion modules, achieving state-of-the-art accuracy and real-time performance on embedded devices.

Contribution

It proposes a new fusion approach with JSCFM and SFM modules within a transformer framework, enhancing cross-modal feature interaction for RGB-T tracking.

Findings

01

CSTNet achieves state-of-the-art tracking accuracy.

02

CSTNet-small runs at 33 fps with minimal performance loss.

03

The model is suitable for real-time deployment on embedded systems.

Abstract

The main problem in RGB-T tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model's ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose CSTNet. It uses the Vision Transformer (ViT) as the backbone and adds a Joint Spatial and Channel Fusion Module (JSCFM) and Spatial Fusion Module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multi-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liyunfenglyf/cstnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Advanced Optical Sensing Technologies · Advanced Vision and Imaging

MethodsAttention Is All You Need · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Adam