AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Henry Zhong; J\"org M. Buchholz; Julian Maclaren; Simon Carlile; Richard F. Lyon

arXiv:2602.19409·cs.SD·February 24, 2026

AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Henry Zhong, J\"org M. Buchholz, Julian Maclaren, Simon Carlile, Richard F. Lyon

PDF

Open Access 1 Datasets

TL;DR

AuditoryHuM introduces an unsupervised, human-MLLM collaborative framework for generating and clustering auditory scene labels, reducing manual effort and enabling scalable, edge-deployable scene recognition models.

Contribution

The paper presents a novel collaborative approach combining MLLMs and human input for automatic auditory scene label discovery and clustering, improving scalability and label quality.

Findings

01

Effective label generation across diverse datasets

02

Improved clustering cohesion with thematic balance

03

Facilitates training of lightweight scene recognition models

Abstract

Manual annotation of audio datasets is labour intensive, and it is challenging to balance label granularity with acoustic separability. We introduce AuditoryHuM, a novel framework for the unsupervised discovery and clustering of auditory scene labels using a collaborative Human-Multimodal Large Language Model (MLLM) approach. By leveraging MLLMs (Gemma and Qwen) the framework generates contextually relevant labels for audio data. To ensure label quality and mitigate hallucinations, we employ zero-shot learning techniques (Human-CLAP) to quantify the alignment between generated text labels and raw audio content. A strategically targeted human-in-the-loop intervention is then used to refine the least aligned pairs. The discovered labels are grouped into thematically cohesive clusters using an adjusted silhouette score that incorporates a penalty parameter to balance cluster cohesion and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hzhongresearch/auditoryhum_supplementary
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Machine Learning and Data Classification · Speech Recognition and Synthesis