Personalizing Causal Audio-Driven Facial Motion via Dynamic Multi-modal Retrieval

Xuangeng Chu; Yu Han; Wei Mao; Shih-En Wei

arXiv:2604.23692·cs.GR·April 28, 2026

Personalizing Causal Audio-Driven Facial Motion via Dynamic Multi-modal Retrieval

Xuangeng Chu, Yu Han, Wei Mao, Shih-En Wei

PDF

TL;DR

This paper introduces a real-time, personalized facial animation framework that uses dynamic multi-modal retrieval and hierarchical motion representation to improve lip-sync accuracy and realism.

Contribution

It presents a novel causal autoregressive architecture with multi-modal style retrieval and hierarchical motion encoding for scalable, personalized facial animation.

Findings

01

Outperforms state-of-the-art in lip-sync accuracy

02

Enhances identity consistency and realism

03

Enables low-latency, scalable personalization

Abstract

Audio-driven facial animation is essential for immersive digital interaction, yet existing frameworks fail to reconcile real-time streaming with high-fidelity personalization. Current methods often rely on latency-inducing audio look-ahead, or require high user compliance to pre-encode static embeddings that fails to capture dynamic idiosyncrasies. We present an end-to-end causal framework for personalizing causal facial motion generation via dynamic multi-modal style retrieval, enabling ultra-low latency while uniquely leveraging unstructured style references. We introduce two key innovations: (1) a temporal hierarchical motion representation that captures global temporal context and high-frequency details while maintaining decoding causality, and (2) a multi-modal style retriever that jointly queries audio and motion to dynamically extract stylistic priors without breaking causality.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.