HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng; Junchao Huang; Xiangsheng Huang; Di Wen; Junwei Zheng; Yufan Chen; Kailun Yang; Jiamin Wu; Chongqing Hao; Rainer Stiefelhagen

arXiv:2506.09650·cs.CV·October 6, 2025

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces HopaDIFF, a novel diffusion-based framework for referring human action segmentation in multi-person videos, along with a new dataset RHAS133, addressing the limitations of existing single-person focused methods.

Contribution

The work pioneers multi-person action segmentation guided by textual references, introduces a new dataset RHAS133, and proposes a Fourier-conditioned diffusion model with a cross-input gate for improved reasoning.

Findings

01

HopaDIFF outperforms existing methods on RHAS133

02

The new dataset RHAS133 contains 137 fine-grained actions in 133 movies

03

Benchmarking shows limited performance of existing methods on multi-person scenarios

Abstract

Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kpeng9510/hopadiff
pytorchOfficial

Videos

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion · Attentive Walk-Aggregating Graph Neural Network