SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning
Anuradha Chopra, Abhinaba Roy, Dorien Herremans

TL;DR
SonicVerse is a multi-task learning model that generates detailed music captions by integrating music feature detection, enhancing descriptive accuracy for both short and long music pieces.
Contribution
Introduces a novel projection-based multi-task architecture that combines captioning with auxiliary music feature detection, improving descriptive quality in music AI.
Findings
Enhanced caption detail and accuracy with feature integration
Effective chaining for long music descriptions using LLMs
Extended MusicBench dataset with music feature annotations
Abstract
Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization
