A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
Julius Broomfield, Kartik Sharma, Srijan Kumar

TL;DR
This study investigates how different modalities like text, images, and stylized text influence the ability of multimodal large language models to embody diverse personas, revealing modality-specific strengths and limitations.
Contribution
The paper introduces a novel dataset and evaluation framework to systematically analyze the impact of various modalities on persona embodiment in multimodal LLMs.
Findings
Text-based personas exhibit more linguistic habits.
Typographical images show higher consistency with personas.
LLMs often overlook image-conveyed persona details.
Abstract
Large language models (LLMs) have recently demonstrated remarkable advancements in embodying diverse personas, enhancing their effectiveness as conversational agents and virtual assistants. Consequently, LLMs have made significant strides in processing and integrating multimodal information. However, even though human personas can be expressed in both text and image, the extent to which the modality of a persona impacts the embodiment by the LLM remains largely unexplored. In this paper, we investigate how do different modalities influence the expressiveness of personas in multimodal LLMs. To this end, we create a novel modality-parallel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images, where text is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Multimodal Machine Learning Applications · Topic Modeling
