NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification

Shihao Li; Huaibo Huang; Junxian Duan; Aihua Zheng; Jin Tang; Jixin Ma

arXiv:2505.20001·cs.CV·May 12, 2026

NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification

Shihao Li, Huaibo Huang, Junxian Duan, Aihua Zheng, Jin Tang, Jixin Ma

PDF

TL;DR

This paper introduces NEXT, a multi-grained mixture of experts framework utilizing text modulation and attribute-based captioning to enhance multi-modal object re-identification accuracy across diverse datasets.

Contribution

The paper proposes a novel multi-grained expert framework with text-modulation and attribute confidence for improved multi-modal object ReID, addressing limitations of implicit feature fusion.

Findings

01

Outperforms state-of-the-art methods on multiple datasets.

02

Effectively models fine-grained and coarse-grained features.

03

Reduces unknown recognition rate via attribute-based captioning.

Abstract

Multi-modal object Re-IDentification (ReID) aims to obtain complete identity features across heterogeneous modalities. However, most existing methods rely on implicit feature fusion modules, making it difficult to model fine-grained recognition patterns under various challenges in real world. Benefiting from the powerful Multi-modal Large Language Models (MLLMs), the object appearances are effectively translated into descriptive captions. In this paper, we propose a reliable caption generation pipeline based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs and improves the quality of generated text. Additionally, to model diverse identity patterns, we propose a novel ReID framework, named NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.