What does a network layer hear? Analyzing hidden representations of   end-to-end ASR through speech synthesis

Chung-Yi Li; Pei-Chieh Yuan; Hung-Yi Lee

arXiv:1911.01102·cs.CL·November 5, 2019·1 cites

What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis

Chung-Yi Li, Pei-Chieh Yuan, Hung-Yi Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method to analyze what information end-to-end speech recognition models retain at each layer by synthesizing speech from hidden representations, revealing how speaker variability and noise are progressively removed.

Contribution

The study presents the first approach to analyze end-to-end ASR layers through speech synthesis, providing insights into the internal representations of the model.

Findings

01

Gradual removal of speaker variability and noise in deeper layers

02

Synthesized speech confirms layer-wise information processing

03

Speaker verification and speech enhancement validate observations

Abstract

End-to-end speech recognition systems have achieved competitive results compared to traditional systems. However, the complex transformations involved between layers given highly variable acoustic signals are hard to analyze. In this paper, we present our ASR probing model, which synthesizes speech from hidden representations of end-to-end ASR to examine the information maintain after each layer calculation. Listening to the synthesized speech, we observe gradual removal of speaker variability and noise as the layer goes deeper, which aligns with the previous studies on how deep network functions in speech recognition. This paper is the first study analyzing the end-to-end speech recognition model by demonstrating what each layer hears. Speaker verification and speech enhancement measurements on synthesized speech are also conducted to confirm our observation further.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YuanPJ/Voice-in-ASR
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing