Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

Nathan Paek; Yongyi Zang; Qihui Yang; Randal Leistikow

arXiv:2510.23802·cs.LG·October 31, 2025

Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow

PDF

TL;DR

This paper introduces a framework that uses sparse autoencoders and linear mappings to interpret and control audio generative models by linking latent features to human-understandable acoustic concepts like pitch and timbre.

Contribution

It presents a novel method for interpreting audio latent spaces by mapping them to acoustic properties, enabling controllable audio synthesis and analysis.

Findings

01

Successfully mapped latent features to acoustic properties

02

Enabled controllable manipulation of generated audio

03

Analyzed evolution of acoustic features during synthesis

Abstract

While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.