MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang

TL;DR
MC-LLaVA introduces a multi-concept personalization approach for vision-language models, enabling them to understand and respond to multiple user-defined concepts simultaneously, with new training strategies, prompts, and a specialized dataset.
Contribution
It presents a novel multi-concept instruction tuning method, personalized prompts, and a high-quality dataset for improved VLM personalization.
Findings
Achieves impressive multi-concept personalized responses
Reduces training costs with visual token-based prompts
Enhances recognition and grounding with visual prompts
Abstract
Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Geographic Information Systems Studies
MethodsFocus
