Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Yingjie Chen; Shilun Lin; Cai Xing; Binxin Yang; Long Zhou; Qixin Yan; Wenjing Wang; Dingming Liu; Hao Liu; Chen Li; Jing Lyu

arXiv:2603.17889·cs.CV·March 26, 2026

Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Yingjie Chen, Shilun Lin, Cai Xing, Binxin Yang, Long Zhou, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu

PDF

Open Access 1 Models

TL;DR

This paper introduces a scalable framework for personalized joint audio-video synthesis that allows fine-grained control of facial appearance and voice timbre across multiple identities, advancing identity-aware content creation.

Contribution

It presents a novel data curation pipeline, a flexible identity injection mechanism, and a multi-stage training strategy for high-fidelity, multi-identity audio-video generation.

Findings

01

Outperforms existing methods in identity consistency and quality.

02

Supports multi-subject interactions with personalized control.

03

Enforces cross-modal coherence effectively.

Abstract

Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
echoanran/Identity-as-Presence
model· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications