Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow

TL;DR
This paper introduces a framework that uses sparse autoencoders and linear mappings to interpret and control audio generative models by linking latent features to human-understandable acoustic concepts like pitch and timbre.
Contribution
It presents a novel method for interpreting audio latent spaces by mapping them to acoustic properties, enabling controllable audio synthesis and analysis.
Findings
Successfully mapped latent features to acoustic properties
Enabled controllable manipulation of generated audio
Analyzed evolution of acoustic features during synthesis
Abstract
While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
