UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars
Xiaoyu Zhan, Xinyu Fu, Chenghao Yang, Xiaohong Zhang, Dongjie Fu, Pengcheng Fang, Tengjiao Sun, Xiaohao Cai, Hansung Kim, Yuanqi Li, Jie Guo, Yanwen Guo

TL;DR
UMo introduces a unified sparse motion modeling framework that enables real-time, high-fidelity co-speech avatar animations by processing text, audio, and motion within a single architecture.
Contribution
The paper proposes UMo, a novel architecture combining sparse modeling and a Mixture-of-Experts framework for efficient, high-quality, real-time co-speech avatar animation.
Findings
UMo achieves superior animation quality under low latency.
It maintains fine-grained speech-motion alignment in real-time.
UMo demonstrates effective facial and gesture animation generation.
Abstract
Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
