HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA
Lei Hu, Yongjing Ye, Shihong Xia

TL;DR
This paper introduces HMVLM, a unified human motion-vision-language model that uses MoE LoRA to enhance multimodal understanding, address catastrophic forgetting, and improve pose representation for diverse downstream tasks.
Contribution
The paper proposes a novel MoE LoRA-based framework with a zero expert and body-part-specific tokenization to improve multimodal learning and mitigate forgetting in human motion-language models.
Findings
Effectively alleviates catastrophic forgetting during instruction-tuning.
Achieves state-of-the-art performance across multiple human motion tasks.
Enhances pose representation with body-part-specific tokenization.
Abstract
The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Human Pose and Action Recognition
