TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

Chetwin Low; Weimin Wang

arXiv:2506.03099·cs.SD·June 4, 2025

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

Chetwin Low, Weimin Wang

PDF

Open Access

TL;DR

TalkingMachines is a real-time, audio-driven video synthesis framework that transforms pretrained models into natural conversational character animators, enabling seamless, high-quality video streaming driven by audio inputs.

Contribution

It adapts a pretrained image-to-video model into an audio-driven avatar generator and introduces efficient inference techniques for real-time performance.

Findings

01

Achieves real-time, high-quality audio-driven video synthesis

02

Enables infinite video streaming without error accumulation

03

Optimizes inference pipeline for low latency and high throughput

Abstract

In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsKnowledge Distillation