MTNet: Learning modality-aware representation with transformer for RGBT tracking

Ruichao Hou; Boyue Xu; Tongwei Ren; Gangshan Wu

arXiv:2508.17280·cs.CV·August 26, 2025

MTNet: Learning modality-aware representation with transformer for RGBT tracking

Ruichao Hou, Boyue Xu, Tongwei Ren, Gangshan Wu

PDF

TL;DR

This paper introduces MTNet, a transformer-based RGBT tracking method that effectively learns modality-specific cues and global dependencies, improving tracking accuracy and robustness in real-time.

Contribution

The paper proposes a novel modality-aware transformer network with specialized modules and a dynamic update strategy for enhanced RGBT tracking.

Findings

01

Achieves state-of-the-art results on three RGBT benchmarks.

02

Operates in real-time with improved accuracy.

03

Effectively handles scale variation and deformation.

Abstract

The ability to learn robust multi-modality representation has played a critical role in the development of RGBT tracking. However, the regular fusion paradigm and the invariable tracking template remain restrictive to the feature interaction. In this paper, we propose a modality-aware tracker based on transformer, termed MTNet. Specifically, a modality-aware network is presented to explore modality-specific cues, which contains both channel aggregation and distribution module(CADM) and spatial similarity perception module (SSPM). A transformer fusion network is then applied to capture global dependencies to reinforce instance representations. To estimate the precise location and tackle the challenges, such as scale variation and deformation, we design a trident prediction head and a dynamic update strategy which jointly maintain a reliable template for facilitating inter-frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.