LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Haojie Yu; Zhaonian Wang; Yihan Pan; Meng Cheng; Hao Yang; Chao Wang; Tao Xie; Xiaoming Xu; Xiaoming Wei; Xunliang Cai

arXiv:2506.05806·cs.CV·June 9, 2025

LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Haojie Yu, Zhaonian Wang, Yihan Pan, Meng Cheng, Hao Yang, Chao Wang, Tao Xie, Xiaoming Xu, Xiaoming Wei, Xunliang Cai

PDF

Open Access

TL;DR

This paper introduces LLIA, a diffusion-based framework for real-time, low-latency, audio-driven avatar video generation that supports high-fidelity, expressive, and seamless two-way communication.

Contribution

The paper presents novel techniques for variable-length video generation, a consistency training strategy, and inference optimizations enabling real-time performance of diffusion models for avatars.

Findings

01

Achieves up to 78 FPS at 384x384 resolution.

02

Initial video latency is reduced to 140 ms.

03

Supports seamless switching between speaking, listening, and idle states.

Abstract

Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness. However, their substantial computational requirements have constrained their deployment in real-time interactive avatar applications, where stringent speed, latency, and duration requirements are paramount. We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Music Technology and Sound Studies