MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Ling-Hao Chen; Shunlin Lu; Ailing Zeng; Hao Zhang; Benyou Wang; Ruimao; Zhang; Lei Zhang

arXiv:2405.20340·cs.CV·May 31, 2024·5 cites

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao, Zhang, Lei Zhang

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

MotionLLM introduces a unified framework that combines video and motion data for improved human behavior understanding, captioning, and reasoning, supported by a new dataset and benchmark for comprehensive evaluation.

Contribution

It presents a novel joint video-motion modeling approach, a large multi-modal dataset MoVid, and a benchmark MoVid-Bench for enhanced human behavior analysis.

Findings

01

MotionLLM outperforms existing models in captioning and reasoning tasks.

02

The combined use of video and motion data improves understanding accuracy.

03

Extensive experiments validate the effectiveness of the proposed framework.

Abstract

This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IDEA-Research/MotionLLM
pytorchOfficial

Models

🤗
EvanTHU/MotionLLM-7B
model· ♡ 2
♡ 2

Datasets

EvanTHU/MoVid
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods