UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors
Xiaoyan Cong, Zekun Li, Zhiyang Dou, Hongyu Li, Omid Taheri, Chuan Guo, Abhay Mittal, Sizhe An, Taku Komura, Wojciech Matusik, Michael J. Black, Srinath Sridhar

TL;DR
UMO introduces a unified framework that leverages pretrained motion foundation models to support diverse motion generation tasks through in-context learning, enabling flexible and efficient adaptation without extensive retraining.
Contribution
The paper proposes a novel unified formulation with learnable frame-level meta-operations that adapt pretrained models for multiple motion tasks within a single framework.
Findings
UMO outperforms task-specific baselines across benchmarks.
Supports diverse tasks like motion editing, inpainting, and multi-identity generation.
Achieves in-context adaptation with negligible runtime overhead.
Abstract
Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications
