Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang; Zhipeng Li; Yiwen Guo; Tianshu Yu

arXiv:2602.07106·cs.CV·February 10, 2026

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

PDF

Open Access

TL;DR

Ex-Omni introduces a framework that enhances large language models with speech-driven 3D facial animation, addressing representation challenges and enabling natural multimodal interactions.

Contribution

It proposes a novel decoupling approach using speech units and a token-as-query fusion mechanism to improve 3D facial animation in large language models.

Findings

01

Ex-Omni achieves competitive performance with existing models.

02

The framework enables stable, aligned speech and facial animation generation.

03

InstructEx dataset facilitates training and evaluation of speech-driven 3D facial animation.

Abstract

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis