ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Rajan Das Gupta; Lei Wei; Md Yeasin Rahat; Nafiz Fahad; Abir Ahmed; Liew Tze Hui

arXiv:2508.09818·cs.CV·January 8, 2026

ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Rajan Das Gupta, Lei Wei, Md Yeasin Rahat, Nafiz Fahad, Abir Ahmed, Liew Tze Hui

PDF

TL;DR

ViMoNet is a multimodal vision-language framework that effectively combines motion and video data to improve human behavior understanding, with applications in healthcare and behavior analysis.

Contribution

The paper introduces ViMoNet, a novel multimodal framework trained with a two-stage alignment and instruction-tuning strategy, along with a new dataset and benchmark for behavior understanding.

Findings

01

ViMoNet outperforms existing methods in captioning and behavior interpretation tasks.

02

The framework demonstrates effectiveness in healthcare applications like fall detection.

03

The VIMOS dataset and ViMoNet-Bench benchmark facilitate standardized evaluation.

Abstract

This study investigates the use of large language models (LLMs) for human behavior understanding by jointly leveraging motion and video data. We argue that integrating these complementary modalities is essential for capturing both fine-grained motion dynamics and contextual semantics of human actions, addressing the limitations of prior motion-only or video-only approaches. To this end, we propose ViMoNet, a multimodal vision-language framework trained through a two-stage alignment and instruction-tuning strategy that combines precise motion-text supervision with large-scale video-text data. We further introduce VIMOS, a multimodal dataset comprising human motion sequences, videos, and instruction-level annotations, along with ViMoNet-Bench, a standardized benchmark for evaluating behavior-centric reasoning. Experimental results demonstrate that ViMoNet consistently outperforms existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.