TL;DR
This paper introduces a new task, Customized Multimodal Role-Play, and a unified model that enables personalized, consistent human-AI interactions across text and images using minimal data.
Contribution
It proposes the CMRP task, constructs the RoleScape-20 dataset, and develops UniCharacter, a two-stage training framework for few-shot multimodal character customization.
Findings
The method outperforms prior approaches on RoleScape-20.
Coherent persona, style, and visual identity are achieved with only 10 images.
Cross-modal consistency and few-shot strategies are validated through ablation studies.
Abstract
Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
