TL;DR
This paper presents a dual-process online speech enhancement method for AR headsets that combines DNN-based beamforming with FastMNMF-guided adaptation, significantly improving speech recognition in noisy environments.
Contribution
It introduces a novel dual-process online speech enhancement approach that adaptively combines deep neural network beamforming with FastMNMF for real-time AR applications.
Findings
Word error rate improved by over 10 points with 12 minutes of adaptation.
Method effectively handles real noisy, reverberant environments.
AR transcription accuracy enhanced through spatial and temporal processing.
Abstract
This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset that helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party). One may use a state-of-the-art blind source separation method called fast multichannel nonnegative matrix factorization (FastMNMF) that works well in various environments thanks to its unsupervised nature. Its heavy computational cost, however, prevents its application to real-time processing. In contrast, a supervised beamforming method that uses a deep neural network (DNN) for estimating spatial information of speech and noise readily fits real-time processing, but suffers from drastic performance degradation in mismatched conditions. Given such complementary characteristics, we propose a dual-process robust online speech enhancement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
