TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

Soumya Mazumdar; Vineet Kumar Rakesh

arXiv:2603.06057·cs.CV·March 9, 2026

TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

Soumya Mazumdar, Vineet Kumar Rakesh

PDF

Open Access

TL;DR

TempoSyncDiff introduces a distilled, temporally-consistent diffusion framework for low-latency, stable, and accurate audio-driven talking head generation suitable for edge devices, addressing key challenges in real-time human synthesis.

Contribution

The paper presents a novel reference-conditioned latent diffusion model with teacher-student distillation, identity anchoring, and temporal regularization for efficient, stable, and realistic talking-head synthesis.

Findings

01

Achieves lower latency inference with retained quality compared to standard diffusion models.

02

Reduces flicker and identity drift through temporal regularization and identity anchoring.

03

Demonstrates feasibility of edge deployment with CPU-only and edge computing measurements.

Abstract

Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Face recognition and analysis · Hearing Loss and Rehabilitation