What Do Speech Foundation Models Not Learn About Speech?
Abdul Waheed, Hanin Atwany, Bhiksha Raj, Rita Singh

TL;DR
This paper investigates what non-verbal cues speech foundation models learn, how these cues are represented across layers, and their adaptability to downstream tasks, revealing insights into their generalization and layer-wise features.
Contribution
It provides a comprehensive analysis of multiple speech models' capacity to capture non-verbal cues and their layer-wise representation characteristics in a zero-shot and fine-tuning context.
Findings
Some models perform well zero-shot despite not being explicitly trained for non-verbal tasks.
Zero-shot performance correlates with the quality of learned representations.
Layer-wise analysis shows a convex relationship between representation separability and model depth.
Abstract
Understanding how speech foundation models capture non-verbal cues is crucial for improving their interpretability and adaptability across diverse tasks. In our work, we analyze several prominent models such as Whisper, Seamless, Wav2Vec, HuBERT, and Qwen2-Audio focusing on their learned representations in both paralinguistic and non-paralinguistic tasks from the Dynamic-SUPERB benchmark. Our study addresses three key questions: (1) What non-verbal cues (e.g., speaker intent, emotion, environmental context) are captured? (2) How are these cues represented across different layers of the models? and (3) To what extent can these representations be effectively adapted to downstream tasks? To answer these questions, we first evaluate the models in a zero-shot setting, followed by fine-tuning on layer-wise features extracted from these models. Our results provide insights into the models'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
