Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Zhen Ye; Xu Tan; Aoxiong Yin; Hongzhan Lin; Guangyan Zhang; Peiwen Sun; Yiming Li; Chi-Min Chan; Wei Ye; Shikun Zhang; Wei Xue

arXiv:2604.23586·cs.CV·April 28, 2026

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Zhen Ye, Xu Tan, Aoxiong Yin, Hongzhan Lin, Guangyan Zhang, Peiwen Sun, Yiming Li, Chi-Min Chan, Wei Ye, Shikun Zhang, Wei Xue

PDF

TL;DR

Talker-T2AV introduces a joint autoregressive diffusion model for talking head synthesis, effectively separating high-level semantic modeling from low-level detail refinement to improve cross-modal coherence and efficiency.

Contribution

It proposes a novel autoregressive diffusion framework with shared high-level modeling and modality-specific decoders for improved talking head generation.

Findings

01

Outperforms dual-branch baselines in lip-sync accuracy.

02

Achieves higher video and audio quality.

03

Demonstrates stronger cross-modal consistency.

Abstract

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.