Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching
Haiyang Liu, Xiaolin Hong, Xuancheng Yang, Yudi Ruan, Xiang Lian, Michael Lingelbach, Hongwei Yi, Wei Li

TL;DR
Livatar is a real-time system for generating talking head videos driven by audio, achieving high lip-sync accuracy and low latency, making high-fidelity avatars accessible for various applications.
Contribution
The paper introduces a flow matching based framework for real-time talking head generation that improves lip-sync accuracy and system efficiency.
Findings
Achieves 8.50 LipSync Confidence on HDTF dataset
Reaches 141 FPS throughput with 0.17s latency on A10 GPU
Outperforms existing methods in lip-sync quality and speed
Abstract
We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality with a 8.50 LipSync Confidence on the HDTF dataset, and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to broader applications. Our project is available at https://www.hedra.com/ with with examples at https://h-liu1997.github.io/Livatar-1/
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Music Technology and Sound Studies
