Personalizing Causal Audio-Driven Facial Motion via Dynamic Multi-modal Retrieval
Xuangeng Chu, Yu Han, Wei Mao, Shih-En Wei

TL;DR
This paper introduces a real-time, personalized facial animation framework that uses dynamic multi-modal retrieval and hierarchical motion representation to improve lip-sync accuracy and realism.
Contribution
It presents a novel causal autoregressive architecture with multi-modal style retrieval and hierarchical motion encoding for scalable, personalized facial animation.
Findings
Outperforms state-of-the-art in lip-sync accuracy
Enhances identity consistency and realism
Enables low-latency, scalable personalization
Abstract
Audio-driven facial animation is essential for immersive digital interaction, yet existing frameworks fail to reconcile real-time streaming with high-fidelity personalization. Current methods often rely on latency-inducing audio look-ahead, or require high user compliance to pre-encode static embeddings that fails to capture dynamic idiosyncrasies. We present an end-to-end causal framework for personalizing causal facial motion generation via dynamic multi-modal style retrieval, enabling ultra-low latency while uniquely leveraging unstructured style references. We introduce two key innovations: (1) a temporal hierarchical motion representation that captures global temporal context and high-frequency details while maintaining decoding causality, and (2) a multi-modal style retriever that jointly queries audio and motion to dynamically extract stylistic priors without breaking causality.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
