STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification

Xingguo Xu; Zhanyu Liu; Weixiang Zhou; Yuansheng Gao; Junjie Cao; Yuhao Wang; Jixiang Luo; Dell Zhang

arXiv:2603.00695·cs.CV·March 3, 2026

STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification

Xingguo Xu, Zhanyu Liu, Weixiang Zhou, Yuansheng Gao, Junjie Cao, Yuhao Wang, Jixiang Luo, Dell Zhang

PDF

Open Access

TL;DR

STMI introduces a comprehensive multi-modal ReID framework that enhances foreground features, reallocates semantic tokens adaptively, and models high-order relationships across modalities, significantly improving retrieval accuracy.

Contribution

The paper presents STMI, a novel framework with segmentation-guided modulation, semantic token reallocation, and hypergraph interaction, advancing multi-modal object ReID by preserving discriminative cues and modeling complex relationships.

Findings

01

Outperforms existing methods on RGBNT201, RGBNT100, and MSVR310 benchmarks.

02

Effectively suppresses background noise and enhances foreground features.

03

Demonstrates robustness and superior accuracy in multi-modal ReID tasks.

Abstract

Multi-modal object Re-Identification (ReID) aims to exploit complementary information from different modalities to retrieve specific objects. However, existing methods often rely on hard token filtering or simple fusion strategies, which can lead to the loss of discriminative cues and increased background interference. To address these challenges, we propose STMI, a novel multi-modal learning framework consisting of three key components: (1) Segmentation-Guided Feature Modulation (SFM) module leverages SAM-generated masks to enhance foreground representations and suppress background noise through learnable attention modulation; (2) Semantic Token Reallocation (STR) module employs learnable query tokens and an adaptive reallocation mechanism to extract compact and informative representations without discarding any tokens; (3) Cross-Modal Hypergraph Interaction (CHI) module constructs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection