Multichannel End-to-end Speech Recognition

Tsubasa Ochiai; Shinji Watanabe; Takaaki Hori; John R. Hershey

arXiv:1703.04783·cs.SD·March 16, 2017·46 cites

Multichannel End-to-end Speech Recognition

Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

PDF

Open Access

TL;DR

This paper introduces a multichannel end-to-end speech recognition system that integrates microphone array processing into neural network architecture, improving recognition accuracy in noisy environments.

Contribution

It extends end-to-end speech recognition to include joint optimization of beamforming and recognition components within a unified neural network framework.

Findings

01

Outperforms baseline attention-based models on noisy benchmarks

02

Joint optimization improves noise robustness

03

Effective integration of beamforming in end-to-end architecture

Abstract

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing