OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao

TL;DR
OmniHuman-1.5 introduces a novel framework that generates semantically coherent and expressive avatar animations by integrating multimodal large language models and a specialized multimodal fusion architecture.
Contribution
The paper presents a new approach combining multimodal large language models with a Pseudo Last Frame design for improved semantic understanding in avatar motion generation.
Findings
Achieves state-of-the-art lip-sync accuracy and motion naturalness.
Demonstrates strong semantic consistency with textual prompts.
Extends effectively to multi-person and non-human scenarios.
Abstract
Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
