LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu

TL;DR
This paper introduces LiveTalk, a real-time multimodal interactive video diffusion system that achieves high-quality, low-latency video generation conditioned on text, image, and audio, enabling seamless human-AI interaction.
Contribution
The paper proposes an improved on-policy distillation method for multimodal video diffusion, significantly reducing inference latency while maintaining visual quality and enhancing multi-turn interaction capabilities.
Findings
Matches the visual quality of larger models with 20x less inference cost
Reduces response latency from 1-2 minutes to real-time
Outperforms state-of-the-art models in multi-turn video coherence
Abstract
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
