HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Xiao Wang; Jingyun Hua; Weihong Lin; Yuanxing Zhang; Fuzheng Zhang; Jianlong Wu; Di Zhang; Liqiang Nie

arXiv:2502.20811·cs.CV·June 10, 2025

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie

PDF

1 Datasets 1 Video

TL;DR

This paper introduces HAIC, a new dataset and annotation pipeline for human action videos, significantly improving multi-modal large language models' understanding and generation capabilities related to human actions.

Contribution

The paper presents a novel two-stage annotation pipeline and curated datasets, HAICTrain and HAICBench, to enhance video understanding and generation of human actions in multi-modal models.

Findings

01

Training with HAICTrain improves performance across 4 benchmarks.

02

HAIC datasets enhance human action understanding.

03

Improved text-to-video generation results.

Abstract

Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. \textbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, \textbf{HAICBench} includes 412 manually annotated video-caption pairs and 2,000 QA pairs, for a comprehensive evaluation of human action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

KuaishouHAIC/HAIC
dataset· 61 dl
61 dl

Videos

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models· underline