Student-Teacher Learning for BLSTM Mask-based Speech Enhancement
Aswin Shanmugam Subramanian, Szu-Jui Chen, Shinji Watanabe

TL;DR
This paper introduces a student-teacher learning approach for single-channel speech enhancement, using multichannel beamformed signals as teacher targets to improve speech recognition accuracy.
Contribution
It proposes a novel student-teacher paradigm that leverages multichannel beamformed masks to enhance single-channel speech signals for better ASR performance.
Findings
Improved speech recognition accuracy on CHiME-4 data
Effective mimicry of multichannel masks by single-channel network
Enhanced speech quality for ASR tasks
Abstract
Spectral mask estimation using bidirectional long short-term memory (BLSTM) neural networks has been widely used in various speech enhancement applications, and it has achieved great success when it is applied to multichannel enhancement techniques with a mask-based beamformer. However, when these masks are used for single channel speech enhancement they severely distort the speech signal and make them unsuitable for speech recognition. This paper proposes a student-teacher learning paradigm for single channel speech enhancement. The beamformed signal from multichannel enhancement is given as input to the teacher network to obtain soft masks. An additional cross-entropy loss term with the soft mask target is combined with the original loss, so that the student network with single-channel input is trained to mimic the soft mask obtained with multichannel input through beamforming.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Indoor and Outdoor Localization Technologies
