Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Shih-Heng Wang; Tiantian Feng; Aditya Kommineni; Thanathai Lertpetchpun; Bowen Yi; Xuan Shi; Shrikanth Narayanan

arXiv:2603.18359·cs.SD·March 20, 2026

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni, Thanathai Lertpetchpun, Bowen Yi, Xuan Shi, Shrikanth Narayanan

PDF

Open Access

TL;DR

This paper introduces a framework using Sparse Autoencoders to interpret how neural audio codecs encode accent information, revealing differences based on codec design and bitrate.

Contribution

It proposes a novel interpretability framework for neural audio codecs and provides insights into their encoding of accent information.

Findings

01

DAC and SpeechTokenizer have highest interpretability.

02

Acoustic-oriented NACs encode accent in activation magnitudes.

03

Phonetic-oriented NACs rely on activation positions.

Abstract

Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis