MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description
Jiawei Mo, Yixuan Chen, Rifen Lin, Yongkang Ni, Min Zeng, Xiping Hu,, Min Li

TL;DR
MoChat is a novel multimodal large language model that enables fine-grained spatio-temporal grounding of human motion and supports multi-turn dialogue understanding, advancing motion comprehension capabilities.
Contribution
The paper introduces MoChat, the first model capable of joint spatio-temporal grounding of human motion and multi-turn dialogue understanding, with a new joints-grouped skeleton encoder.
Findings
Achieves state-of-the-art performance in motion understanding tasks.
Effectively captures fine-grained spatio-temporal details.
Supports multi-turn dialogue for motion description.
Abstract
Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. Such limitations in capturing fine-grained motion details reduce their effectiveness in motion understanding tasks. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and understanding multi-turn dialogue context. To achieve these capabilities, we group the spatial information of each skeleton frame based on human anatomical structure and then apply them with Joints-Grouped Skeleton Encoder, whose outputs are combined with LLM embeddings to create spatio-aware and temporal-aware embeddings separately. Additionally, we develop a pipeline for extracting timestamps from skeleton sequences based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Hand Gesture Recognition Systems
