MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt
Yuhao Wang, Xuehu Liu, Tianyu Yan, Yang Liu, Aihua Zheng, and Pingping Zhang, Huchuan Lu

TL;DR
MambaPro introduces a novel multi-modal object Re-ID framework that adapts large-scale pre-trained models with advanced aggregation and prompt techniques, achieving robust feature extraction and improved performance on multiple benchmarks.
Contribution
The paper proposes MambaPro, a new framework that adapts CLIP for multi-modal ReID using PFA, SRP, and Mamba Aggregation, addressing sequence length limitations and enhancing feature robustness.
Findings
Outperforms existing methods on three benchmarks.
Efficiently models interactions between modalities.
Extracts more robust features with lower complexity.
Abstract
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Contrastive Language-Image Pre-training · Adapter
