Interpretable Embeddings of Speech Enhance and Explain Brain Encoding Performance of Audio Models
Riki Shimizu, Richard J. Antonello, Chandan Singh, Nima Mesgarani

TL;DR
This study demonstrates that speech foundation models' alignment with brain responses is primarily driven by their encoding of interpretable speech features, and combining these features with SFMs enhances brain encoding interpretability and performance.
Contribution
The paper introduces a method to interpret SFM representations using explicit speech features and shows how this improves understanding of brain encoding.
Findings
SFMs' brain alignment is mainly due to simple speech features.
SFMs show a trade-off between low-level and high-level feature encoding.
SFMs learn brain-relevant semantics that grow with model size and context.
Abstract
Speech foundation models (SFMs) are increasingly hailed as powerful computational models of human speech perception. However, since their representations are inherently black-box, it remains unclear what drives their alignment with brain responses. To remedy this, we built linear encoding models from six interpretable feature families: mel-spectrogram, Gabor filter bank features, speech presence, phonetic, syntactic, and semantic features, and contextualized embeddings from three state-of-the-art SFMs (Whisper, HuBERT, WavLM), quantifying electrocorticography (ECoG) response variance shared between feature classes. Variance-partitioning analyses revealed several key insights: First, the SFMs' alignment with the brain can be mostly explained by their ability to learn and encode simple interpretable speech features. Second, SFMs exhibit a systematic trade-off between encoding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Hearing Loss and Rehabilitation
