TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

Xiangyu Liu; Feng Gao; Xiaomei Zhang; Yong Zhang; Xiaoming Wei; Zhen Lei; Xiangyu Zhu

arXiv:2604.14580·cs.CV·May 7, 2026

TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

Xiangyu Liu, Feng Gao, Xiaomei Zhang, Yong Zhang, Xiaoming Wei, Zhen Lei, Xiangyu Zhu

PDF

TL;DR

TurboTalk is a novel two-stage distillation framework that compresses multi-step audio-driven video diffusion models into a single-step generator, significantly accelerating inference while maintaining quality.

Contribution

It introduces a progressive distillation approach with a stable training strategy to enable one-step video avatar generation from audio.

Findings

01

Achieves 120x faster inference speed.

02

Maintains high-quality video avatar generation.

03

Uses a novel progressive distillation and training stabilization techniques.

Abstract

Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.