Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models
Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

TL;DR
Ex-Omni introduces a framework that enhances large language models with speech-driven 3D facial animation, addressing representation challenges and enabling natural multimodal interactions.
Contribution
It proposes a novel decoupling approach using speech units and a token-as-query fusion mechanism to improve 3D facial animation in large language models.
Findings
Ex-Omni achieves competitive performance with existing models.
The framework enables stable, aligned speech and facial animation generation.
InstructEx dataset facilitates training and evaluation of speech-driven 3D facial animation.
Abstract
Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
