MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn   Motion Comprehension and Description

Jiawei Mo; Yixuan Chen; Rifen Lin; Yongkang Ni; Min Zeng; Xiping Hu,; Min Li

arXiv:2410.11404·cs.CV·October 16, 2024

MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description

Jiawei Mo, Yixuan Chen, Rifen Lin, Yongkang Ni, Min Zeng, Xiping Hu,, Min Li

PDF

Open Access 1 Models

TL;DR

MoChat is a novel multimodal large language model that enables fine-grained spatio-temporal grounding of human motion and supports multi-turn dialogue understanding, advancing motion comprehension capabilities.

Contribution

The paper introduces MoChat, the first model capable of joint spatio-temporal grounding of human motion and multi-turn dialogue understanding, with a new joints-grouped skeleton encoder.

Findings

01

Achieves state-of-the-art performance in motion understanding tasks.

02

Effectively captures fine-grained spatio-temporal details.

03

Supports multi-turn dialogue for motion description.

Abstract

Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. Such limitations in capturing fine-grained motion details reduce their effectiveness in motion understanding tasks. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and understanding multi-turn dialogue context. To achieve these capabilities, we group the spatial information of each skeleton frame based on human anatomical structure and then apply them with Joints-Grouped Skeleton Encoder, whose outputs are combined with LLM embeddings to create spatio-aware and temporal-aware embeddings separately. Additionally, we develop a pipeline for extracting timestamps from skeleton sequences based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
CSUBioGroup/MoChat
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Hand Gesture Recognition Systems