Far-Field Automatic Speech Recognition

Reinhold Haeb-Umbach (1); Jahn Heymann (2); Lukas Drude; Shinji; Watanabe (3); Marc Delcroix (4); Tomohiro Nakatani (4) ((1) Paderborn; University; Germany; (2) Amazon Aachen; Germany; (3) Johns-Hopkins; University; Baltimore; USA; (4) NTT Communication Science Laboratories,; Kyoto; Japan)

arXiv:2009.09395·eess.AS·September 22, 2020·Proc. IEEE

Far-Field Automatic Speech Recognition

Reinhold Haeb-Umbach (1), Jahn Heymann (2), Lukas Drude, Shinji, Watanabe (3), Marc Delcroix (4), Tomohiro Nakatani (4) ((1) Paderborn, University, Germany, (2) Amazon Aachen, Germany, (3) Johns-Hopkins, University, Baltimore, USA, (4) NTT Communication Science Laboratories,

PDF

Open Access

TL;DR

This paper reviews advances in far-field automatic speech recognition, highlighting signal enhancement, robust training, and end-to-end architectures that improve recognition accuracy in distant speech scenarios.

Contribution

It provides a comprehensive overview of algorithms and techniques, including deep learning and traditional signal processing, for effective far-field speech recognition.

Findings

01

Deep learning significantly advances far-field ASR.

02

Combining traditional signal processing with deep learning yields effective solutions.

03

End-to-end architectures are promising for distant speech recognition.

Abstract

The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase of attention in science and industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile it has entered the consumer market with digital home assistants with a spoken language interface being its most prominent application. Speech recorded at a distance is affected by various acoustic distortions and, consequently, quite different processing pipelines have emerged compared to ASR for close-talk speech. A signal enhancement front-end for dereverberation, source separation and acoustic beamforming is employed to clean up the speech, and the back-end ASR engine is robustified by multi-condition training and adaptation. We will also describe the so-called end-to-end approach to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing