Unsupervised Evaluation of Deep Audio Embeddings for Music Structure Analysis
Axel Marmoret

TL;DR
This paper evaluates nine pre-trained deep audio models for music structure analysis using unsupervised segmentation, highlighting the effectiveness of modern embeddings and proposing improved evaluation standards.
Contribution
It introduces an unsupervised evaluation framework for deep audio embeddings in MSA and compares multiple segmentation algorithms, emphasizing the need for more rigorous metrics.
Findings
Deep embeddings often outperform spectrogram baselines.
CBM segmentation algorithm is consistently effective.
Unsupervised boundary estimation surpasses linear probing baselines.
Abstract
Music Structure Analysis (MSA) aims to uncover the high-level organization of musical pieces. State-of-the-art methods are often based on supervised deep learning, but these methods are bottlenecked by the need for heavily annotated data and inherent structural ambiguities. In this paper, we propose an unsupervised evaluation of nine open-source, generic pre-trained deep audio models, on MSA. For each model, we extract barwise embeddings and segment them using three unsupervised segmentation algorithms (Foote's checkerboard kernels, spectral clustering, and Correlation Block-Matching (CBM)), focusing exclusively on boundary retrieval. Our results demonstrate that modern, generic deep embeddings generally outperform traditional spectrogram-based baselines, but not systematically. Furthermore, our unsupervised boundary estimation methodology generally yields stronger performance than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
