Temporal Adaptation of Pre-trained Foundation Models for Music Structure Analysis
Yixiao Zhang, Haonan Chen, Ju-Chiang Wang, Jitong Chen

TL;DR
This paper introduces a temporal adaptation method for pre-trained music models that improves music structure analysis of long audio tracks efficiently, with better boundary detection and structural prediction.
Contribution
It proposes a novel temporal adaptation approach that enables efficient full-length song analysis by extending audio windows and using low-resolution adaptation.
Findings
Improved boundary detection accuracy
Enhanced structural function prediction
Maintained inference speed and memory efficiency
Abstract
Audio-based music structure analysis (MSA) is an essential task in Music Information Retrieval that remains challenging due to the complexity and variability of musical form. Recent advances highlight the potential of fine-tuning pre-trained music foundation models for MSA tasks. However, these models are typically trained with high temporal feature resolution and short audio windows, which limits their efficiency and introduces bias when applied to long-form audio. This paper presents a temporal adaptation approach for fine-tuning music foundation models tailored to MSA. Our method enables efficient analysis of full-length songs in a single forward pass by incorporating two key strategies: (1) audio window extension and (2) low-resolution adaptation. Experiments on the Harmonix Set and RWC-Pop datasets show that our method significantly improves both boundary detection and structural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception
