HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios
Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

TL;DR
This paper introduces HopaDIFF, a novel diffusion-based framework for referring human action segmentation in multi-person videos, along with a new dataset RHAS133, addressing the limitations of existing single-person focused methods.
Contribution
The work pioneers multi-person action segmentation guided by textual references, introduces a new dataset RHAS133, and proposes a Fourier-conditioned diffusion model with a cross-input gate for improved reasoning.
Findings
HopaDIFF outperforms existing methods on RHAS133
The new dataset RHAS133 contains 137 fine-grained actions in 133 movies
Benchmarking shows limited performance of existing methods on multi-person scenarios
Abstract
Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion · Attentive Walk-Aggregating Graph Neural Network
