HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis

Junhao Wu; Xiuer Gu; Zhiying Li; Yeying Jin; Yunfeng Diao; Zhiyu Li; Zhenbo Song; Xiaomei Zhang; Zhaoxin Fan

arXiv:2508.16942·cs.CV·August 26, 2025

HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis

Junhao Wu, Xiuer Gu, Zhiying Li, Yeying Jin, Yunfeng Diao, Zhiyu Li, Zhenbo Song, Xiaomei Zhang, Zhaoxin Fan

PDF

TL;DR

HieroAction is a hierarchical vision-language model that provides interpretable, step-by-step assessments of human actions, combining structured reasoning with reinforcement learning to improve accuracy in action evaluation tasks.

Contribution

The paper introduces HieroAction, a novel model that integrates stepwise action reasoning and hierarchical policy learning for fine-grained, interpretable action assessment.

Findings

01

Outperforms existing methods on multiple benchmark datasets.

02

Provides structured, interpretable evaluations of human actions.

03

Enhances scoring accuracy through reinforcement learning.

Abstract

Evaluating human actions with clear and detailed feedback is important in areas such as sports, healthcare, and robotics, where decisions rely not only on final outcomes but also on interpretable reasoning. However, most existing methods provide only a final score without explanation or detailed analysis, limiting their practical applicability. To address this, we introduce HieroAction, a vision-language model that delivers accurate and structured assessments of human actions. HieroAction builds on two key ideas: (1) Stepwise Action Reasoning, a tailored chain of thought process designed specifically for action assessment, which guides the model to evaluate actions step by step, from overall recognition through sub action analysis to final scoring, thus enhancing interpretability and structured understanding; and (2) Hierarchical Policy Learning, a reinforcement learning strategy that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.