ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video
Rajan Das Gupta, Lei Wei, Md Yeasin Rahat, Nafiz Fahad, Abir Ahmed, Liew Tze Hui

TL;DR
ViMoNet is a multimodal vision-language framework that effectively combines motion and video data to improve human behavior understanding, with applications in healthcare and behavior analysis.
Contribution
The paper introduces ViMoNet, a novel multimodal framework trained with a two-stage alignment and instruction-tuning strategy, along with a new dataset and benchmark for behavior understanding.
Findings
ViMoNet outperforms existing methods in captioning and behavior interpretation tasks.
The framework demonstrates effectiveness in healthcare applications like fall detection.
The VIMOS dataset and ViMoNet-Bench benchmark facilitate standardized evaluation.
Abstract
This study investigates the use of large language models (LLMs) for human behavior understanding by jointly leveraging motion and video data. We argue that integrating these complementary modalities is essential for capturing both fine-grained motion dynamics and contextual semantics of human actions, addressing the limitations of prior motion-only or video-only approaches. To this end, we propose ViMoNet, a multimodal vision-language framework trained through a two-stage alignment and instruction-tuning strategy that combines precise motion-text supervision with large-scale video-text data. We further introduce VIMOS, a multimodal dataset comprising human motion sequences, videos, and instruction-level annotations, along with ViMoNet-Bench, a standardized benchmark for evaluating behavior-centric reasoning. Experimental results demonstrate that ViMoNet consistently outperforms existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
