ConceptCaps: a Distilled Concept Dataset for Interpretability in Music Models
Bruno Sienkiewicz, {\L}ukasz Neumann, Mateusz Modrzejewski

TL;DR
ConceptCaps is a new music dataset with explicit concept labels, enabling better interpretability and analysis of music models through improved semantic separation and controllability.
Contribution
We introduce ConceptCaps, a large-scale music dataset with explicit concept labels and a novel pipeline separating semantic modeling from text and audio synthesis.
Findings
TCAV analysis confirms meaningful concept recovery
Audio-text alignment shows high coherence
Linguistic metrics indicate high-quality descriptions
Abstract
Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 21k music-caption-tags triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Music and Audio Processing · Topic Modeling
