Hierarchical Activity Recognition and Captioning from Long-Form Audio

Peng Zhang; Qingyu Luo; Philip J.B. Jackson; Wenwu Wang

arXiv:2602.06765·cs.SD·February 9, 2026

Hierarchical Activity Recognition and Captioning from Long-Form Audio

Peng Zhang, Qingyu Luo, Philip J.B. Jackson, Wenwu Wang

PDF

Open Access

TL;DR

This paper introduces MultiAct, a comprehensive dataset and benchmark for hierarchical activity recognition and captioning in long-form audio, addressing the limitations of prior short-clip focused work.

Contribution

It presents a new dataset with multi-level annotations and captions, along with a unified hierarchical model for structured understanding of long-duration audio.

Findings

01

Established strong baseline results on MultiAct

02

Identified key challenges in modeling hierarchical audio structures

03

Highlighted future directions for capturing long-range relationships

Abstract

Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Emotion and Mood Recognition