SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Anuradha Chopra; Abhinaba Roy; Dorien Herremans

arXiv:2506.15154·cs.SD·June 19, 2025

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Anuradha Chopra, Abhinaba Roy, Dorien Herremans

PDF

Open Access 1 Repo 1 Models

TL;DR

SonicVerse is a multi-task learning model that generates detailed music captions by integrating music feature detection, enhancing descriptive accuracy for both short and long music pieces.

Contribution

Introduces a novel projection-based multi-task architecture that combines captioning with auxiliary music feature detection, improving descriptive quality in music AI.

Findings

01

Enhanced caption detail and accuracy with feature integration

02

Effective chaining for long music descriptions using LLMs

03

Extended MusicBench dataset with music feature annotations

Abstract

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amaai-lab/sonicverse
pytorchOfficial

Models

🤗
amaai-lab/SonicVerse
model· 37 dl· ♡ 19
37 dl♡ 19

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization