Speaker Adapted Beamforming for Multi-Channel Automatic Speech Recognition
Tobias Menne, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper introduces a speaker adaptation method for multi-channel ASR that fine-tunes mask estimation within a beamforming framework, enhancing recognition accuracy for specific speakers using a two-pass approach.
Contribution
It proposes a novel integration of mask-based beamforming with acoustic model training, enabling speaker-specific adaptation through retraining the mask estimation network.
Findings
Improved ASR performance on CHiME-4 data.
Effective speaker-specific beamforming adaptation.
Analysis of mask estimation changes due to adaptation.
Abstract
This paper presents, in the context of multi-channel ASR, a method to adapt a mask based, statistically optimal beamforming approach to a speaker of interest. The beamforming vector of the statistically optimal beamformer is computed by utilizing speech and noise masks, which are estimated by a neural network. The proposed adaptation approach is based on the integration of the beamformer, which includes the mask estimation network, and the acoustic model of the ASR system. This allows for the propagation of the training error, from the acoustic modeling cost function, all the way through the beamforming operation and through the mask estimation network. By using the results of a first pass recognition and by keeping all other parameters fixed, the mask estimation network can therefore be fine tuned by retraining. Utterances of a speaker of interest can thus be used in a two pass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
