RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba
Andong Lu, Wanyu Wang, Chenglong Li, Jin Tang, and Bin Luo

TL;DR
This paper introduces AINet, a novel RGBT tracking network that performs efficient, all-layer multimodal interactions using progressive fusion, significantly improving robustness and performance in multimodal tracking tasks.
Contribution
The paper proposes a new All-layer multimodal Interaction Network with a Difference-based Fusion Mamba and Order-dynamic Fusion Mamba for efficient, comprehensive feature interaction across all layers.
Findings
Achieves state-of-the-art performance on four RGBT tracking datasets.
Effectively balances interaction capability and computational efficiency.
Demonstrates robustness and accuracy improvements over existing methods.
Abstract
Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Vision and Imaging · Face recognition and analysis
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
