S2Cap: A Benchmark and a Baseline for Singing Style Captioning
Hyunjong Ok, Jaeho Lee

TL;DR
This paper introduces S2Cap, a comprehensive dataset for singing style captioning, and proposes a simple baseline algorithm, addressing the lack of detailed singing voice datasets for downstream tasks.
Contribution
The paper presents S2Cap, a new dataset with detailed singing voice descriptions, and a baseline algorithm for singing style captioning, filling a key gap in the field.
Findings
S2Cap dataset covers diverse vocal and acoustic attributes.
Baseline algorithm achieves initial performance on singing style captioning.
Dataset availability facilitates future research in singing voice analysis.
Abstract
Singing voices contain much richer information than common voices, including varied vocal and acoustic properties. However, current open-source audio-text datasets for singing voices capture only a narrow range of attributes and lack acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally define the singing style captioning task and present S2Cap, a dataset of singing voices with detailed descriptions covering diverse vocal, acoustic, and demographic characteristics. Using this dataset, we develop an efficient and straightforward baseline algorithm for singing style captioning. The dataset is available at https://zenodo.org/records/15673764.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Diverse Musicological Studies
MethodsFocus · Sparse Evolutionary Training
