UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Hebeizi Li; Zihao Liang; Benyuan Sun; Zihao Yin; Xiao Sha; Chenliang Wang; Yi Yang

arXiv:2603.01418·cs.CV·March 3, 2026

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Hebeizi Li, Zihao Liang, Benyuan Sun, Zihao Yin, Xiao Sha, Chenliang Wang, Yi Yang

PDF

Open Access

TL;DR

UniTalking is an accessible, end-to-end diffusion framework that generates high-quality, lip-synced talking portraits with personalized voice cloning, outperforming existing open-source models in realism and accuracy.

Contribution

It introduces a unified, end-to-end diffusion model with multi-modal transformers for high-fidelity talking portrait generation and personalized voice cloning.

Findings

01

Achieves superior lip-sync accuracy and visual fidelity.

02

Demonstrates effective voice cloning from brief audio references.

03

Outperforms existing open-source methods in quality and realism.

Abstract

While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis