Discovering and Steering Interpretable Concepts in Large Generative Music Models
Nikhil Singh, Manuel Cherep, Pattie Maes

TL;DR
This paper introduces a scalable method using sparse autoencoders to discover and steer interpretable concepts in large autoregressive music models, revealing both known and novel musical patterns.
Contribution
It presents a novel approach for extracting interpretable features from transformer-based music models and demonstrates how these concepts can be used to control model outputs.
Findings
Revealed both traditional and novel musical concepts in models
Developed scalable automated labeling and validation pipelines
Showed concepts can be used to steer model generations
Abstract
The fidelity with which neural networks can now generate content such as music presents a scientific opportunity: these systems appear to have learned implicit theories of such content's structure through statistical learning alone. This offers a potentially new lens on theories of human-generated media. When internal representations align with traditional constructs (e.g. chord progressions in music), they show how such categories can emerge from statistical regularities; when they diverge, they expose limits of existing frameworks and patterns we may have overlooked but that nonetheless carry explanatory power. In this paper, focusing on autoregressive music generators, we introduce a method for discovering interpretable concepts using sparse autoencoders (SAEs), extracting interpretable features from the residual stream of a transformer model. We make this approach scalable and…
Peer Reviews
Decision·ICLR 2026 Poster
1. This is the first paper to perform concept discovery with SAE in pretrained music generators, and the results are much better compared to the previous probing works. 2. The design choices in Sec. 3.3 provide very useful insights for future researchers in concept discovery for pretrained audio/music models. 3. The automatic labeling pipeline is promising (with some limitation as stated in weakness) and could be applied to other models (audio generation/understanding models). In general, thi
1. In sec. 3.5 automated interpretability, the automatic pipeline might harm the interpretation of some concepts (i.e., chord & keys) since neither gemini nor essentia has such ability. 2. Currently the number of examples are very limited. More examples/case study would be useful in the appendix, including possible failure cases where the automatic labeling pipeline fail to conclude. I.e., more examples where: (1) A concept could be successfully extracted and named; (2) A concept could be extr
To the best of my knowledge, this is the first attempt to apply SAE-based interpretability to audio generation LM; this is a meaningful extension beyond NLP/vision that addresses an underexplored modality. The modular pipeline is well-documented and reproducible. The authors conducted systematic exploration across model scales, layers, sparsity levels, and expansion factors provides useful data on where interpretable structure emerges in music LMs. Table 1 and Fig. 3 offer valuable design guidan
(1) I notice CLAP serves as both the filter for accepting labels and the metric for evaluating interpretability and steering success. This creates a validation loop where the authors are essentially measuring "does this feature change CLAP scores" rather than "does this feature control meaningful musical concepts"; The human study is too limited to break this circular argument issue. (2) 15–35% success rate with a single prompt and CLAP-only evaluation is insufficient to claim "robust" controll
This paper engages with the interpretability literature to provide a lens into the relationship between a highly complicated model and the sort of language we use as humans to describe music. This provides much needed research at the intersection of model interpretability and music generation. They successfully determine that they can use SAE's to uncover important information about how music data is structured, including not only the fact that there are such labels, but that they are in some
1. I am unconvinced by the contributions of this paper. Fundamentally, the claim that they are producing concepts we "don't have a name for" seems blatantly false given that they use a multimodal model to label them (unless I'm misunderstanding something). 2. The numbers for the steering example don't seem impressive without more information - having "any improvement at all" on 20-25% of features compared to None could be noise. 3. For a paper about music, it is very limiting to take the "m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
MethodsFocus · ALIGN · Sparse Evolutionary Training
