Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech   Recognition

Feng-Ju Chang; Anastasios Alexandridis; Rupak Vignesh Swaminathan,; Martin Radfar; Harish Mallidi; Maurizio Omologo; Athanasios Mouchtaris; Brian; King; Roland Maas

arXiv:2303.00692·eess.AS·March 2, 2023·1 cites

Leveraging Redundancy in Multiple Audio Signals for Far-Field Speech Recognition

Feng-Ju Chang, Anastasios Alexandridis, Rupak Vignesh Swaminathan,, Martin Radfar, Harish Mallidi, Maurizio Omologo, Athanasios Mouchtaris, Brian, King, Roland Maas

PDF

Open Access

TL;DR

This paper introduces fusion networks that leverage redundancy in multiple audio signals, including post-AEC and AFE outputs, to improve far-field speech recognition accuracy, demonstrating significant WER reduction.

Contribution

It proposes novel fusion networks combining post-AEC and AFE signals, enhancing robustness in far-field ASR beyond traditional single-signal approaches.

Findings

01

Up to 25.9% relative WER reduction with fusion networks.

02

Fusion networks outperform single-signal models.

03

Minimal parameter increase (~2%) for significant accuracy gains.

Abstract

To achieve robust far-field automatic speech recognition (ASR), existing techniques typically employ an acoustic front end (AFE) cascaded with a neural transducer (NT) ASR model. The AFE output, however, could be unreliable, as the beamforming output in AFE is steered to a wrong direction. A promising way to address this issue is to exploit the microphone signals before the beamforming stage and after the acoustic echo cancellation (post-AEC) in AFE. We argue that both, post-AEC and AFE outputs, are complementary and it is possible to leverage the redundancy between these signals to compensate for potential AFE processing errors. We present two fusion networks to explore this redundancy and aggregate these multi-channel (MC) signals: (1) Frequency-LSTM based, and (2) Convolutional Neural Network based fusion networks. We augment the MC fusion networks to a conformer transducer model and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Acoustic Wave Phenomena Research