ConceptCaps: a Distilled Concept Dataset for Interpretability in Music Models

Bruno Sienkiewicz; {\L}ukasz Neumann; Mateusz Modrzejewski

arXiv:2601.14157·cs.SD·February 5, 2026

ConceptCaps: a Distilled Concept Dataset for Interpretability in Music Models

Bruno Sienkiewicz, {\L}ukasz Neumann, Mateusz Modrzejewski

PDF

Open Access 1 Datasets

TL;DR

ConceptCaps is a new music dataset with explicit concept labels, enabling better interpretability and analysis of music models through improved semantic separation and controllability.

Contribution

We introduce ConceptCaps, a large-scale music dataset with explicit concept labels and a novel pipeline separating semantic modeling from text and audio synthesis.

Findings

01

TCAV analysis confirms meaningful concept recovery

02

Audio-text alignment shows high coherence

03

Linguistic metrics indicate high-quality descriptions

Abstract

Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 21k music-caption-tags triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

bsienkiewicz/ConceptCaps
dataset· 300 dl
300 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Music and Audio Processing · Topic Modeling