Unsupervised Evaluation of Deep Audio Embeddings for Music Structure Analysis

Axel Marmoret

arXiv:2603.27218·cs.SD·March 31, 2026

Unsupervised Evaluation of Deep Audio Embeddings for Music Structure Analysis

Axel Marmoret

PDF

TL;DR

This paper evaluates nine pre-trained deep audio models for music structure analysis using unsupervised segmentation, highlighting the effectiveness of modern embeddings and proposing improved evaluation standards.

Contribution

It introduces an unsupervised evaluation framework for deep audio embeddings in MSA and compares multiple segmentation algorithms, emphasizing the need for more rigorous metrics.

Findings

01

Deep embeddings often outperform spectrogram baselines.

02

CBM segmentation algorithm is consistently effective.

03

Unsupervised boundary estimation surpasses linear probing baselines.

Abstract

Music Structure Analysis (MSA) aims to uncover the high-level organization of musical pieces. State-of-the-art methods are often based on supervised deep learning, but these methods are bottlenecked by the need for heavily annotated data and inherent structural ambiguities. In this paper, we propose an unsupervised evaluation of nine open-source, generic pre-trained deep audio models, on MSA. For each model, we extract barwise embeddings and segment them using three unsupervised segmentation algorithms (Foote's checkerboard kernels, spectral clustering, and Correlation Block-Matching (CBM)), focusing exclusively on boundary retrieval. Our results demonstrate that modern, generic deep embeddings generally outperform traditional spectrogram-based baselines, but not systematically. Furthermore, our unsupervised boundary estimation methodology generally yields stronger performance than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.