PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Fatemeh Nazarieh; Zhenhua Feng; Diptesh Kanojia; Muhammad Awais; Josef Kittler

arXiv:2412.07754·cs.CV·October 2, 2025

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, Josef Kittler

PDF

Open Access

TL;DR

PortraitTalk is a novel framework for customizable one-shot audio-to-talking face generation that emphasizes visual quality, personalization, and generalization, using a latent diffusion model with innovative control mechanisms.

Contribution

It introduces a new diffusion-based approach with decoupled cross-attention for enhanced control and reduces reliance on reference videos, advancing realistic talking face synthesis.

Findings

01

Outperforms state-of-the-art methods in quality and control

02

Incorporates text prompts for creative customization

03

Develops a new evaluation metric for talking face generation

Abstract

Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to producing realistic talking faces. To address these limitations, we introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. IdentityNet is designed to preserve identity features consistently across the generated video frames, while AnimateNet aims to enhance temporal coherence and motion consistency. This framework also integrates an audio input with the reference images, thereby reducing the reliance on reference-style videos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing

MethodsDiffusion