Personalized Visual Instruction Tuning
Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, Tong, Zhang

TL;DR
This paper introduces PVIT, a framework for fine-tuning multimodal models to recognize and engage in personalized dialogues about specific individuals in images, addressing the 'face blindness' limitation.
Contribution
We propose a novel data curation and training pipeline for personalized visual instruction tuning and introduce P-Bench, a benchmark for evaluating personalized capabilities in MLLMs.
Findings
Significant improvement in personalized recognition and dialogue generation.
Effective data generation pipeline leveraging visual experts and large models.
Benchmark demonstrating enhanced personalized performance.
Abstract
Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as "face blindness". Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices, or domestic robots that need to recognize members of the family. In this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image and engage in personalized and coherent dialogues. Our approach involves the development of a sophisticated pipeline that autonomously generates training data containing personalized…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed Personalized Visual Instruction Tuning (PVIT) MLLMs conduct personalized conversation based on a general MLLM for personalized context without further tuning during test time. 2. The benchmark P-Bench of around 1500 samples could be used on relevant tasks. 3. Good qualitative and quantitative results. 4. The proposal method is easy to follow and the presentation is clear.
1. In other work like Yo'llava, the personalization including various types of subjects, while in this work, it is limited to human face, which is different to the claim in the paper such as in the introduction. 2. For evaluation comparisons, many actual SOTA MLLMs are not included such as GPT-4. I tried those models and found out they perform pretty good in the task. For example, for GPT4o are correct on several examples in Figure 2. 3. No failure cases analysis. I do not think such method cou
1. PVIT presents a novel approach to MLLM personalization by leveraging in-context learning capabilities, avoiding the need for additional training for each individual. 2. A benchmark is proposed for future research on this direction. 3. The paper is well-written and clearly explains the PVIT framework, data generation process, and experimental setup.
1. Some typos like Line 191-192 Two special tokens are mentioned, however, only one is described 2. The current implementation focuses primarily on names and faces. 3. The paper primarily focuses on individuals present in the training data. Some investigation on the zero-shot ability would help to understand this framework's ability. 4. This problem is an interesting setting, but the proposed methods is not that novel. 5. Current experiments are not that enough, more experiments as described in
1. PVIT’s multi-phase pipeline is designed to generate personalized training data. By integrating visual expert models, image generation tools, and large language models, the authors ensure a rich and diverse dataset that enhances the personalization capabilities of MLLMs. 2. The authors contribute P-Bench, a novel benchmark designed to evaluate personalization capabilities in MLLMs. 3. Experiments demonstrate that PVIT improves MLLMs’ accuracy in handling personalized queries across various t
1. Reliance on Pretrained MLLMs and Potential Hallucinations: The pipeline relies heavily on pretrained multimodal large language models (MLLMs) for tasks such as image captioning and textual information generation. However, these models are known to sometimes produce hallucinated details. Without mechanisms to detect or filter out such inaccuracies, the generated data could introduce noise into the training process, leading to degraded model performance or unpredictable responses. To strengthen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Intelligent Tutoring Systems and Adaptive Learning · Multimedia Communication and Technology
