A Unified Framework for Motion Reasoning and Generation in Human Interaction
Jeongeun Park, Sungjoon Choi, Sangdoo Yun

TL;DR
This paper introduces VIM, a unified model that integrates language and motion understanding to generate and control interactive human motions in multi-turn conversations, supported by a large-scale dataset Inter-MT2.
Contribution
The paper presents VIM, a novel unified architecture for simultaneous motion and language processing, and introduces Inter-MT2, a large-scale dataset for interactive motion instruction tuning.
Findings
VIM effectively handles multiple interactive motion tasks.
Inter-MT2 enables training of versatile motion-language models.
VIM demonstrates strong performance across diverse motion understanding and generation tasks.
Abstract
Recent advancements in large language models (LLMs) have significantly improved their ability to generate natural and contextually relevant text, enabling more human-like AI interactions. However, generating and understanding interactive human-like motion, where multiple individuals engage in coordinated movements, remains challenging due to the complexity of modeling these interactions. Additionally, a unified and versatile model is needed to handle diverse interactive scenarios, such as chat systems that dynamically adapt to user instructions and assigned roles. To address these challenges, we introduce VIM, the Versatile Interactive Motion-language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. Unlike previous studies that primarily focus on uni-directional tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation
MethodsALIGN
