Do Foundational Audio Encoders Understand Music Structure?

Keisuke Toyama; Zhi Zhong; Akira Takahashi; Shusuke Takahashi; Yuki Mitsufuji

arXiv:2512.17209·cs.SD·January 30, 2026

Do Foundational Audio Encoders Understand Music Structure?

Keisuke Toyama, Zhi Zhong, Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji

PDF

Open Access

TL;DR

This paper investigates whether foundational audio encoders pretrained on music data can effectively understand music structure, revealing that self-supervised models with masked language modeling excel in music structure analysis tasks.

Contribution

It provides a comprehensive evaluation of 11 types of FAEs for music structure analysis, highlighting the effectiveness of self-supervised masked language models.

Findings

01

Self-supervised FAEs with masked language modeling perform best in MSA.

02

Training data and model context length significantly influence MSA performance.

03

Limited exploration of FAEs for music structure analysis prior to this study.

Abstract

In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored: only a small subset of FAEs has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using self-supervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in FAE and MSA.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies