Exploiting Single-Channel Speech For Multi-channel End-to-end Speech   Recognition

Keyu An; Zhijian Ou

arXiv:2107.02670·eess.AS·July 7, 2021

Exploiting Single-Channel Speech For Multi-channel End-to-end Speech Recognition

Keyu An, Zhijian Ou

PDF

Open Access

TL;DR

This paper investigates leveraging single-channel speech data to enhance multi-channel end-to-end speech recognition, proposing three methods—pre-training, data scheduling, and data simulation—and demonstrating their effectiveness on benchmark datasets.

Contribution

The paper introduces three novel schemes to incorporate single-channel data into multi-channel end-to-end speech recognition systems, improving training stability and recognition accuracy.

Findings

01

All three methods improve system performance.

02

Data scheduling offers a simpler and less costly approach.

03

Performance depends on front-end choice, data augmentation, and data size.

Abstract

Recently, the end-to-end training approach for neural beamformer-supported multi-channel ASR has shown its effectiveness in multi-channel speech recognition. However, the integration of multiple modules makes it more difficult to perform end-to-end training, particularly given that the multi-channel speech corpus recorded in real environments with a sizeable data scale is relatively limited. This paper explores the usage of single-channel data to improve the multi-channel end-to-end speech recognition system. Specifically, we design three schemes to exploit the single-channel data, namely pre-training, data scheduling, and data simulation. Extensive experiments on CHiME4 and AISHELL-4 datasets demonstrate that all three methods improve the multi-channel end-to-end training stability and speech recognition performance, while the data scheduling approach keeps a much simpler pipeline (vs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Blind Source Separation Techniques