Towards Fine-Grained Human Motion Video Captioning
Guorui Song, Guocun Wang, Zhe Huang, Jing Lin, Xuefei Zhe, Jian Li, Haoqian Wang

TL;DR
This paper introduces a novel motion-aware video captioning model that improves the accuracy and detail of human action descriptions in videos by leveraging human motion representations and a new dataset.
Contribution
The work presents the Motion-Augmented Caption Model (M-ACM), a new generative framework that explicitly incorporates human motion information to enhance caption quality.
Findings
M-ACM outperforms previous methods in describing complex human motions.
The HMI dataset contains 115K video-description pairs focused on human movement.
Experimental results show improved semantic fidelity and spatial alignment in captions.
Abstract
Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
