Interpretable Embeddings of Speech Enhance and Explain Brain Encoding Performance of Audio Models

Riki Shimizu; Richard J. Antonello; Chandan Singh; Nima Mesgarani

arXiv:2507.16080·q-bio.NC·September 26, 2025

Interpretable Embeddings of Speech Enhance and Explain Brain Encoding Performance of Audio Models

Riki Shimizu, Richard J. Antonello, Chandan Singh, Nima Mesgarani

PDF

Open Access

TL;DR

This study demonstrates that speech foundation models' alignment with brain responses is primarily driven by their encoding of interpretable speech features, and combining these features with SFMs enhances brain encoding interpretability and performance.

Contribution

The paper introduces a method to interpret SFM representations using explicit speech features and shows how this improves understanding of brain encoding.

Findings

01

SFMs' brain alignment is mainly due to simple speech features.

02

SFMs show a trade-off between low-level and high-level feature encoding.

03

SFMs learn brain-relevant semantics that grow with model size and context.

Abstract

Speech foundation models (SFMs) are increasingly hailed as powerful computational models of human speech perception. However, since their representations are inherently black-box, it remains unclear what drives their alignment with brain responses. To remedy this, we built linear encoding models from six interpretable feature families: mel-spectrogram, Gabor filter bank features, speech presence, phonetic, syntactic, and semantic features, and contextualized embeddings from three state-of-the-art SFMs (Whisper, HuBERT, WavLM), quantifying electrocorticography (ECoG) response variance shared between feature classes. Variance-partitioning analyses revealed several key insights: First, the SFMs' alignment with the brain can be mostly explained by their ability to learn and encode simple interpretable speech features. Second, SFMs exhibit a systematic trade-off between encoding of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Hearing Loss and Rehabilitation