A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning
Siyang Jiang, Mu Yuan, Xiang Ji, Bufang Yang, Zeyu Liu, Lilin Xu, Yang Li, Yuting He, Liran Dong, Wenrui Lu, Zhenyu Yan, Xiaofan Jiang, Wei Gao, Hongkai Chen, Guoliang Xing

TL;DR
This paper introduces CUHK-X, a large-scale multimodal dataset with benchmarks for human activity recognition, understanding, and reasoning, addressing the lack of fine-grained, multimodal data for advancing LLMs and LVLMs in complex activity analysis.
Contribution
The paper presents CUHK-X, a new multimodal dataset with scene creation methods and benchmarks for HAU and HARn, enabling improved LLM and LVLM applications in activity understanding.
Findings
Achieved 76.52% accuracy in HAR
Achieved 40.76% accuracy in HAU
Achieved 70.25% accuracy in HARn
Abstract
Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)
