Consistency-aware multi-channel speech enhancement using deep neural networks
Yoshiki Masuyama, Masahito Togami, Tatsuya Komatsu

TL;DR
This paper introduces a consistency-aware multi-channel speech enhancement system using deep neural networks that optimizes the quality of the reconstructed time-domain signal, addressing issues with spectrogram inconsistency.
Contribution
It proposes a novel objective function based on reconstructed time-domain signals for training DNN-based multi-channel speech enhancement systems.
Findings
Improved speech quality over traditional T-F masking methods
Effective enhancement with reconstructed time-domain objective functions
Demonstrated superiority in experimental comparisons
Abstract
This paper proposes a deep neural network (DNN)-based multi-channel speech enhancement system in which a DNN is trained to maximize the quality of the enhanced time-domain signal. DNN-based multi-channel speech enhancement is often conducted in the time-frequency (T-F) domain because spatial filtering can be efficiently implemented in the T-F domain. In such a case, ordinary objective functions are computed on the estimated T-F mask or spectrogram. However, the estimated spectrogram is often inconsistent, and its amplitude and phase may change when the spectrogram is converted back to the time-domain. That is, the objective function does not evaluate the enhanced time-domain signal properly. To address this problem, we propose to use an objective function defined on the reconstructed time-domain signal. Specifically, speech enhancement is conducted by multi-channel Wiener filtering in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Hearing Loss and Rehabilitation
