Analyzing Hidden Representations in End-to-End Automatic Speech   Recognition Systems

Yonatan Belinkov; James Glass

arXiv:1709.04482·cs.CL·September 15, 2017·35 cites

Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Yonatan Belinkov, James Glass

PDF

Open Access 1 Repo

TL;DR

This paper investigates how deep end-to-end neural speech recognition models learn and represent speech features, analyzing layer-wise representations to understand their interpretability and effectiveness in predicting phonetic labels.

Contribution

It provides a detailed analysis of speech representations in end-to-end models, highlighting how different layers contribute to phonetic prediction and model interpretability.

Findings

01

Layer depth affects representation quality for phoneme prediction.

02

Deeper layers tend to encode more abstract speech features.

03

Model complexity influences the interpretability of learned representations.

Abstract

Neural models have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

boknilev/asr-repr-analysis
torchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling