KMM: Key Frame Mask Mamba for Extended Motion Generation
Zeyu Zhang, Hang Gao, Akide Liu, Qi Chen, Feng Chen, Yiran Wang,, Danning Li, Rui Zhao, Zhenming Li, Zhongwen Zhou, Hao Tang, Bohan Zhuang

TL;DR
This paper introduces KMM, a novel architecture that enhances long and complex human motion generation by focusing on key frames, improving multimodal fusion, and achieving state-of-the-art results on the BABEL dataset.
Contribution
The paper proposes KMM with key frame masking, a contrastive learning paradigm for better multimodal fusion, and demonstrates superior performance on human motion generation tasks.
Findings
Achieved over 57% reduction in FID score.
Reduced model parameters by 70% compared to previous methods.
Enhanced focus on key actions in motion segments.
Abstract
Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay. Secondly, Mamba struggles with multimodal fusion compared to Transformers, and lack alignment with textual queries, often confusing directions (left or right) or omitting parts of longer text queries. To address these challenges, our paper presents three key contributions: Firstly, we introduce KMM, a novel architecture featuring Key frame Masking Modeling, designed to enhance Mamba's focus on key actions in motion segments.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Mechanisms and Dynamics · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
MethodsContrastive Learning · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Focus
