A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

Siyang Jiang; Mu Yuan; Xiang Ji; Bufang Yang; Zeyu Liu; Lilin Xu; Yang Li; Yuting He; Liran Dong; Wenrui Lu; Zhenyu Yan; Xiaofan Jiang; Wei Gao; Hongkai Chen; Guoliang Xing

arXiv:2512.07136·cs.CV·December 9, 2025

A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

Siyang Jiang, Mu Yuan, Xiang Ji, Bufang Yang, Zeyu Liu, Lilin Xu, Yang Li, Yuting He, Liran Dong, Wenrui Lu, Zhenyu Yan, Xiaofan Jiang, Wei Gao, Hongkai Chen, Guoliang Xing

PDF

Open Access

TL;DR

This paper introduces CUHK-X, a large-scale multimodal dataset with benchmarks for human activity recognition, understanding, and reasoning, addressing the lack of fine-grained, multimodal data for advancing LLMs and LVLMs in complex activity analysis.

Contribution

The paper presents CUHK-X, a new multimodal dataset with scene creation methods and benchmarks for HAU and HARn, enabling improved LLM and LVLM applications in activity understanding.

Findings

01

Achieved 76.52% accuracy in HAR

02

Achieved 40.76% accuracy in HAU

03

Achieved 70.25% accuracy in HARn

Abstract

Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)