Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models
Dewen Zhang, Wangpeng An, Hayaru Shouno

TL;DR
This paper introduces a novel method for generating human pose and action instruction-following data by integrating keypoints with visual features, significantly improving multimodal model performance on human-centric tasks.
Contribution
It presents a new data generation approach combining keypoints with visual features and establishes a benchmark for human pose and action understanding in multimodal models.
Findings
Achieved a 21.18% performance improvement on the benchmark.
Created a dataset with 200,328 samples for fine-tuning.
Enhanced model understanding of human poses and actions.
Abstract
Current vision-language multimodal models are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish a benchmark called Human Pose and Action Understanding Benchmark (HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition
