UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

Xiaoyan Cong; Zekun Li; Zhiyang Dou; Hongyu Li; Omid Taheri; Chuan Guo; Abhay Mittal; Sizhe An; Taku Komura; Wojciech Matusik; Michael J. Black; Srinath Sridhar

arXiv:2603.15975·cs.CV·March 18, 2026

UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

Xiaoyan Cong, Zekun Li, Zhiyang Dou, Hongyu Li, Omid Taheri, Chuan Guo, Abhay Mittal, Sizhe An, Taku Komura, Wojciech Matusik, Michael J. Black, Srinath Sridhar

PDF

Open Access

TL;DR

UMO introduces a unified framework that leverages pretrained motion foundation models to support diverse motion generation tasks through in-context learning, enabling flexible and efficient adaptation without extensive retraining.

Contribution

The paper proposes a novel unified formulation with learnable frame-level meta-operations that adapt pretrained models for multiple motion tasks within a single framework.

Findings

01

UMO outperforms task-specific baselines across benchmarks.

02

Supports diverse tasks like motion editing, inpainting, and multi-identity generation.

03

Achieves in-context adaptation with negligible runtime overhead.

Abstract

Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications