Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion
Yunfeng Li, Bo Wang, Ye Li

TL;DR
This paper introduces CSTNet, a transformer-based RGB-T tracking model that employs novel channel and spatial feature fusion modules, achieving state-of-the-art accuracy and real-time performance on embedded devices.
Contribution
It proposes a new fusion approach with JSCFM and SFM modules within a transformer framework, enhancing cross-modal feature interaction for RGB-T tracking.
Findings
CSTNet achieves state-of-the-art tracking accuracy.
CSTNet-small runs at 33 fps with minimal performance loss.
The model is suitable for real-time deployment on embedded systems.
Abstract
The main problem in RGB-T tracking is the correct and optimal merging of the cross-modal features of visible and thermal images. Some previous methods either do not fully exploit the potential of RGB and TIR information for channel and spatial feature fusion or lack a direct interaction between the template and the search area, which limits the model's ability to fully utilize the original semantic information of both modalities. To address these limitations, we investigate how to achieve a direct fusion of cross-modal channels and spatial features in RGB-T tracking and propose CSTNet. It uses the Vision Transformer (ViT) as the backbone and adds a Joint Spatial and Channel Fusion Module (JSCFM) and Spatial Fusion Module (SFM) integrated between the transformer blocks to facilitate cross-modal feature interaction. The JSCFM module achieves joint modeling of channel and multi-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Advanced Optical Sensing Technologies · Advanced Vision and Imaging
MethodsAttention Is All You Need · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Adam
