Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Yonatan Belinkov, James Glass

TL;DR
This paper investigates how deep end-to-end neural speech recognition models learn and represent speech features, analyzing layer-wise representations to understand their interpretability and effectiveness in predicting phonetic labels.
Contribution
It provides a detailed analysis of speech representations in end-to-end models, highlighting how different layers contribute to phonetic prediction and model interpretability.
Findings
Layer depth affects representation quality for phoneme prediction.
Deeper layers tend to encode more abstract speech features.
Model complexity influences the interpretability of learned representations.
Abstract
Neural models have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
