Direction-Aware Joint Adaptation of Neural Speech Enhancement and   Recognition in Real Multiparty Conversational Environments

Yicheng Du; Aditya Arie Nugraha; Kouhei Sekiguchi; Yoshiaki Bando,; Mathieu Fontaine; Kazuyoshi Yoshii

arXiv:2207.07273·eess.AS·July 18, 2022

Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments

Yicheng Du, Aditya Arie Nugraha, Kouhei Sekiguchi, Yoshiaki Bando,, Mathieu Fontaine, Kazuyoshi Yoshii

PDF

Open Access

TL;DR

This paper introduces a direction-aware joint adaptation method for neural speech enhancement and recognition in real multiparty environments, improving recognition accuracy for AR headsets amid dynamic conditions.

Contribution

It proposes a semi-supervised, joint adaptation technique that updates both speech mask estimation and ASR models in real-time, addressing training-test mismatch and head movement challenges.

Findings

01

Significant ASR performance improvements demonstrated

02

Effective adaptation in real multiparty scenarios

03

Enhanced robustness to head movements and environmental noise

Abstract

This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments. A major approach that has actively been studied in simulated environments is to sequentially perform speech enhancement and automatic speech recognition (ASR) based on deep neural networks (DNNs) trained in a supervised manner. In our task, however, such a pretrained system fails to work due to the mismatch between the training and test conditions and the head movements of the user. To enhance only the utterances of a target speaker, we use beamforming based on a DNN-based speech mask estimator that can adaptively extract the speech components corresponding to a head-relative particular direction. We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis

MethodsTest