Do You Listen with One or Two Microphones? A Unified ASR Model for   Single and Multi-Channel Audio

Gokce Keskin; Minhua Wu; Brian King; Harish Mallidi; Yang Gao; Jasha; Droppo; Ariya Rastrow; Roland Maas

arXiv:2106.02750·eess.AS·June 30, 2021

Do You Listen with One or Two Microphones? A Unified ASR Model for Single and Multi-Channel Audio

Gokce Keskin, Minhua Wu, Brian King, Harish Mallidi, Yang Gao, Jasha, Droppo, Ariya Rastrow, Roland Maas

PDF

Open Access

TL;DR

This paper introduces a unified ASR model capable of processing both single and multi-channel audio, improving accuracy and flexibility in real-world scenarios with variable auxiliary data availability.

Contribution

The authors propose a novel unified ASR architecture and training methodology that effectively handles primary-only and primary-plus-auxiliary audio inputs.

Findings

01

Up to 12.5% relative WERR over primary-only baseline

02

Up to 16.0% relative WERR in low-SNR conditions

03

Up to 2.5% relative WERR over primary-plus-auxiliary baseline

Abstract

Automatic speech recognition (ASR) models are typically designed to operate on a single input data type, e.g. a single or multi-channel audio streamed from a device. This design decision assumes the primary input data source does not change and if an additional (auxiliary) data source is occasionally available, it cannot be used. An ASR model that operates on both primary and auxiliary data can achieve better accuracy compared to a primary-only solution; and a model that can serve both primary-only (PO) and primary-plus-auxiliary (PPA) modes is highly desirable. In this work, we propose a unified ASR model that can serve both modes. We demonstrate its efficacy in a realistic scenario where a set of devices typically stream a single primary audio channel, and two additional auxiliary channels only when upload bandwidth allows it. The architecture enables a unique methodology that uses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing