Exploring End-to-End Multi-channel ASR with Bias Information for Meeting Transcription
Xiaofei Wang, Naoyuki Kanda, Yashesh Gaur, Zhuo Chen, Zhong Meng,, Takuya Yoshioka

TL;DR
This paper investigates joint multi-channel speech enhancement and recognition for meeting transcription, leveraging large-scale single-channel data and bias information to improve accuracy in practical scenarios.
Contribution
It introduces a novel joint modeling framework with effective training strategies and a location bias integration method, enhancing meeting transcription performance.
Findings
Significant WER reduction achieved on meeting recordings.
Effective use of simulated and real multi-channel data for training.
Proposed deep concatenation method improves location bias integration.
Abstract
Joint optimization of multi-channel front-end and automatic speech recognition (ASR) has attracted much interest. While promising results have been reported for various tasks, past studies on its meeting transcription application were limited to small scale experiments. It is still unclear whether such a joint framework can be beneficial for a more practical setup where a massive amount of single channel training data can be leveraged for building a strong ASR back-end. In this work, we present our investigation on the joint modeling of a mask-based beamformer and Attention-Encoder-Decoder-based ASR in the setting where we have 75k hours of single-channel data and a relatively small amount of real multi-channel data for model training. We explore effective training procedures, including a comparison of simulated and real multi-channel training data. To guide the recognition towards a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
