Is the Ideal Ratio Mask Really the Best? -- Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers
Atsuo Hiroe (1), Katsutoshi Itoyama (1, 2), Kazuhiro Nakadai (2), ((1) Department of Systems, Control Engineering, School of Engineering,, Tokyo Institute of Technology, Tokyo, Japan, (2) Honda Research Institute, Japan Co., Ltd., Saitama, Japan)

TL;DR
This paper compares different mask-based beamformers to identify which achieves the best speech extraction, revealing that the optimal mask varies by beamformer and is not always the ideal ratio mask, challenging conventional assumptions.
Contribution
It systematically investigates the optimal masks for various beamformers and demonstrates that the IRM is not always the best choice, informing better design of mask-based beamformers.
Findings
All beamformers tested reach the performance upper bound set by the ideal MWF.
The optimal mask differs across beamformers and is not always the IRM.
Conventional assumptions about the IRM being universally optimal are challenged.
Abstract
This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best performance common for all BFs? 3) Is the ideal ratio mask (IRM) identical to the optimal mask? Accordingly, we investigate these issues considering four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. To obtain the optimal mask corresponding to the peak performance for each BF, we employ an approach that minimizes the mean square error between the BF output and target speech for each utterance. Via the experiments with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
