Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation
Yingjie Chen, Shilun Lin, Cai Xing, Binxin Yang, Long Zhou, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu

TL;DR
This paper introduces a scalable framework for personalized joint audio-video synthesis that allows fine-grained control of facial appearance and voice timbre across multiple identities, advancing identity-aware content creation.
Contribution
It presents a novel data curation pipeline, a flexible identity injection mechanism, and a multi-stage training strategy for high-fidelity, multi-identity audio-video generation.
Findings
Outperforms existing methods in identity consistency and quality.
Supports multi-subject interactions with personalized control.
Enforces cross-modal coherence effectively.
Abstract
Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications
