Is the Ideal Ratio Mask Really the Best? -- Exploring the Best   Extraction Performance and Optimal Mask of Mask-based Beamformers

Atsuo Hiroe (1); Katsutoshi Itoyama (1; 2); Kazuhiro Nakadai (2); ((1) Department of Systems; Control Engineering; School of Engineering,; Tokyo Institute of Technology; Tokyo; Japan; (2) Honda Research Institute; Japan Co.; Ltd.; Saitama; Japan)

arXiv:2309.12065·eess.AS·September 22, 2023

Is the Ideal Ratio Mask Really the Best? -- Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers

Atsuo Hiroe (1), Katsutoshi Itoyama (1, 2), Kazuhiro Nakadai (2), ((1) Department of Systems, Control Engineering, School of Engineering,, Tokyo Institute of Technology, Tokyo, Japan, (2) Honda Research Institute, Japan Co., Ltd., Saitama, Japan)

PDF

Open Access

TL;DR

This paper compares different mask-based beamformers to identify which achieves the best speech extraction, revealing that the optimal mask varies by beamformer and is not always the ideal ratio mask, challenging conventional assumptions.

Contribution

It systematically investigates the optimal masks for various beamformers and demonstrates that the IRM is not always the best choice, informing better design of mask-based beamformers.

Findings

01

All beamformers tested reach the performance upper bound set by the ideal MWF.

02

The optimal mask differs across beamformers and is not always the IRM.

03

Conventional assumptions about the IRM being universally optimal are challenged.

Abstract

This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best performance common for all BFs? 3) Is the ideal ratio mask (IRM) identical to the optimal mask? Accordingly, we investigate these issues considering four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. To obtain the optimal mask corresponding to the peak performance for each BF, we employ an approach that minimizes the mean square error between the BF output and target speech for each utterance. Via the experiments with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques