Bringing Interpretability to Neural Audio Codecs
Samir Sadok, Julien Hauret, \'Eric Bavu

TL;DR
This paper investigates how neural audio codecs encode speech attributes and proposes a method to interpret these encodings, enhancing understanding of their internal representations and improving transparency.
Contribution
It introduces a two-step approach combining analysis and a post-hoc explanation network to interpret speech information in neural audio codec tokens.
Findings
Insight into speech attribute encoding within codec tokens
A novel post-hoc explanation network for neural audio codecs
Enhanced interpretability of neural audio representations
Abstract
The advent of neural audio codecs has increased in popularity due to their potential for efficiently modeling audio with transformers. Such advanced codecs represent audio from a highly continuous waveform to low-sampled discrete units. In contrast to semantic units, acoustic units may lack interpretability because their training objectives primarily focus on reconstruction performance. This paper proposes a two-step approach to explore the encoding of speech information within the codec tokens. The primary goal of the analysis stage is to gain deeper insight into how speech attributes such as content, identity, and pitch are encoded. The synthesis stage then trains an AnCoGen network for post-hoc explanation of codecs to extract speech attributes from the respective tokens directly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
