LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Ethan Chern; Zhulin Hu; Bohao Tang; Jiadi Su; Steffi Chern; Zhijie Deng; Pengfei Liu

arXiv:2512.23576·cs.CV·December 30, 2025

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu

PDF

Open Access 1 Models

TL;DR

This paper introduces LiveTalk, a real-time multimodal interactive video diffusion system that achieves high-quality, low-latency video generation conditioned on text, image, and audio, enabling seamless human-AI interaction.

Contribution

The paper proposes an improved on-policy distillation method for multimodal video diffusion, significantly reducing inference latency while maintaining visual quality and enhancing multi-turn interaction capabilities.

Findings

01

Matches the visual quality of larger models with 20x less inference cost

02

Reduces response latency from 1-2 minutes to real-time

03

Outperforms state-of-the-art models in multi-turn video coherence

Abstract

Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
GAIR/LiveTalk-1.3B-V0.1
model· 32 dl· ♡ 15
32 dl♡ 15

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning