What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis
Chung-Yi Li, Pei-Chieh Yuan, Hung-Yi Lee

TL;DR
This paper introduces a novel method to analyze what information end-to-end speech recognition models retain at each layer by synthesizing speech from hidden representations, revealing how speaker variability and noise are progressively removed.
Contribution
The study presents the first approach to analyze end-to-end ASR layers through speech synthesis, providing insights into the internal representations of the model.
Findings
Gradual removal of speaker variability and noise in deeper layers
Synthesized speech confirms layer-wise information processing
Speaker verification and speech enhancement validate observations
Abstract
End-to-end speech recognition systems have achieved competitive results compared to traditional systems. However, the complex transformations involved between layers given highly variable acoustic signals are hard to analyze. In this paper, we present our ASR probing model, which synthesizes speech from hidden representations of end-to-end ASR to examine the information maintain after each layer calculation. Listening to the synthesized speech, we observe gradual removal of speaker variability and noise as the layer goes deeper, which aligns with the previous studies on how deep network functions in speech recognition. This paper is the first study analyzing the end-to-end speech recognition model by demonstrating what each layer hears. Speaker verification and speech enhancement measurements on synthesized speech are also conducted to confirm our observation further.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
