MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Ruichuan An; Sihan Yang; Renrui Zhang; Ming Lu; Tianyi Jiang; Kai Zeng; Yulin Luo; Jiajun Cao; Hao Liang; Ying Chen; Qi She; Shanghang Zhang; Wentao Zhang

arXiv:2411.11706·cs.CV·February 19, 2026·2 cites

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang

PDF

Open Access 1 Repo

TL;DR

MC-LLaVA introduces a multi-concept personalization approach for vision-language models, enabling them to understand and respond to multiple user-defined concepts simultaneously, with new training strategies, prompts, and a specialized dataset.

Contribution

It presents a novel multi-concept instruction tuning method, personalized prompts, and a high-quality dataset for improved VLM personalization.

Findings

01

Achieves impressive multi-concept personalized responses

02

Reduces training costs with visual token-based prompts

03

Enhances recognition and grounding with visual prompts

Abstract

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arctanxarc/mc-llava
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Geographic Information Systems Studies

MethodsFocus