HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA

Lei Hu; Yongjing Ye; Shihong Xia

arXiv:2511.01463·cs.CV·November 4, 2025

HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA

Lei Hu, Yongjing Ye, Shihong Xia

PDF

Open Access

TL;DR

This paper introduces HMVLM, a unified human motion-vision-language model that uses MoE LoRA to enhance multimodal understanding, address catastrophic forgetting, and improve pose representation for diverse downstream tasks.

Contribution

The paper proposes a novel MoE LoRA-based framework with a zero expert and body-part-specific tokenization to improve multimodal learning and mitigate forgetting in human motion-language models.

Findings

01

Effectively alleviates catastrophic forgetting during instruction-tuning.

02

Achieves state-of-the-art performance across multiple human motion tasks.

03

Enhances pose representation with body-part-specific tokenization.

Abstract

The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Human Pose and Action Recognition