To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions
Ju-Chiang Wang, Yun-Ning Hung, Jordan B. L. Smith

TL;DR
This paper presents a multi-task deep learning framework using a spectral-temporal Transformer model to identify musical structural functions like verse and chorus directly from audio, outperforming existing methods.
Contribution
It introduces a novel 7-class taxonomy for song segments, consolidates multiple datasets, and employs a Transformer-based model with a new CTL loss for improved structural analysis.
Findings
Outperforms state-of-the-art chorus detection methods.
Achieves strong boundary detection results.
Effective cross-dataset generalization.
Abstract
Conventional music structure analysis algorithms aim to divide a song into segments and to group them with abstract labels (e.g., 'A', 'B', and 'C'). However, explicitly identifying the function of each segment (e.g., 'verse' or 'chorus') is rarely attempted, but has many applications. We introduce a multi-task deep learning framework to model these structural semantic labels directly from audio by estimating "verseness," "chorusness," and so forth, as a function of time. We propose a 7-class taxonomy (i.e., intro, verse, chorus, bridge, outro, instrumental, and silence) and provide rules to consolidate annotations from four disparate datasets. We also propose to use a spectral-temporal Transformer-based model, called SpecTNT, which can be trained with an additional connectionist temporal localization (CTL) loss. In cross-dataset evaluations using four public datasets, we demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
