Metric Analysis for Spatial Semantic Segmentation of Sound Scenes

Mayank Mishra; Paul Magron; Romain Serizel

arXiv:2511.07075·cs.SD·February 27, 2026

Metric Analysis for Spatial Semantic Segmentation of Sound Scenes

Mayank Mishra, Paul Magron, Romain Serizel

PDF

Open Access

TL;DR

This paper introduces CASA-SDR, a new metric for spatial sound scene analysis that better separates source separation quality from classification errors, improving evaluation accuracy in complex audio scenarios.

Contribution

The paper proposes CASA-SDR, a permutation-invariant metric that enhances the evaluation of spatial semantic sound scene analysis by focusing on separation quality independent of classification errors.

Findings

01

CASA-SDR reduces penalization of label swaps compared to CA-SDR.

02

CASA-SDR provides more interpretable separation assessments.

03

Analysis shows CA-SDR can conflate separation and classification errors.

Abstract

Spatial semantic segmentation of sound scenes (S5) consists of jointly performing audio source separation and sound event classification from a multichannel audio mixture. Evaluating S5 systems with separation and classification metrics individually makes system comparison difficult, whereas existing joint metrics, such as the class-aware signal-to-distortion ratio (CA-SDR), can conflate separation and labeling errors. In particular, CA-SDR relies on predicted class labels for source matching, which may obscure label swaps or misclassifications when the underlying source estimates remain perceptually correct. In this work, we introduce the class and source-aware signal-to-distortion ratio (CASA-SDR), a new metric that performs permutation-invariant source matching before computing classification errors, thereby shifting from a classification-focused approach to a separation-focused…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis