MLLMReID: Multimodal Large Language Model-based Person Re-identification

Shan Yang; Yongfei Zhang

arXiv:2401.13201·cs.CV·June 11, 2024·2 cites

MLLMReID: Multimodal Large Language Model-based Person Re-identification

Shan Yang, Yongfei Zhang

PDF

Open Access

TL;DR

This paper introduces MLLMReID, a novel approach that adapts multimodal large language models for person re-identification by using common instructions and a multi-task learning synchronization module, achieving superior results.

Contribution

It proposes a simple instruction method and a multi-task learning synchronization module to effectively adapt MLLMs for ReID tasks, addressing overfitting and training synchronization issues.

Findings

01

MLLMReID outperforms existing methods in ReID accuracy.

02

The common instruction approach simplifies instruction design.

03

Synchronization improves visual encoder training effectiveness.

Abstract

Multimodal large language models (MLLM) have achieved satisfactory results in many tasks. However, their performance in the task of ReID (ReID) has not been explored to date. This paper will investigate how to adapt them for the task of ReID. An intuitive idea is to fine-tune MLLM with ReID image-text datasets, and then use their visual encoder as a backbone for ReID. However, there still exist two apparent issues: (1) Designing instructions for ReID, MLLMs may overfit specific instructions, and designing a variety of instructions will lead to higher costs. (2) When fine-tuning the visual encoder of a MLLM, it is not trained synchronously with the ReID task. As a result, the effectiveness of the visual encoder fine-tuning cannot be directly reflected in the performance of the ReID task. To address these problems, this paper proposes MLLMReID: Multimodal Large Language Model-based ReID.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Multimodal Machine Learning Applications