Bringing Interpretability to Neural Audio Codecs

Samir Sadok; Julien Hauret; \'Eric Bavu

arXiv:2506.04492·eess.AS·September 19, 2025·Interspeech

Bringing Interpretability to Neural Audio Codecs

Samir Sadok, Julien Hauret, \'Eric Bavu

PDF

TL;DR

This paper investigates how neural audio codecs encode speech attributes and proposes a method to interpret these encodings, enhancing understanding of their internal representations and improving transparency.

Contribution

It introduces a two-step approach combining analysis and a post-hoc explanation network to interpret speech information in neural audio codec tokens.

Findings

01

Insight into speech attribute encoding within codec tokens

02

A novel post-hoc explanation network for neural audio codecs

03

Enhanced interpretability of neural audio representations

Abstract

The advent of neural audio codecs has increased in popularity due to their potential for efficiently modeling audio with transformers. Such advanced codecs represent audio from a highly continuous waveform to low-sampled discrete units. In contrast to semantic units, acoustic units may lack interpretability because their training objectives primarily focus on reconstruction performance. This paper proposes a two-step approach to explore the encoding of speech information within the codec tokens. The primary goal of the analysis stage is to gain deeper insight into how speech attributes such as content, identity, and pitch are encoded. The synthesis stage then trains an AnCoGen network for post-hoc explanation of codecs to extract speech attributes from the respective tokens directly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus