Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition
Felix Weninger, Marco Gaudesi, Ralf Leibold, Roberto Gemello, Puming, Zhan

TL;DR
This paper introduces a dual-encoder architecture with an encoder selection network for joint close-talk and far-talk speech recognition, improving accuracy by leveraging both input sources and outperforming single-encoder systems.
Contribution
The paper presents a novel dual-encoder ASR system with encoder selection, combining CT and FT speech inputs for enhanced recognition accuracy.
Findings
Up to 9% relative WER reduction with combined CT and FT inputs.
Effective joint training of single-channel and multi-channel encoders.
Validated on medical conversational speech data.
Abstract
In this paper, we propose a dual-encoder ASR architecture for joint modeling of close-talk (CT) and far-talk (FT) speech, in order to combine the advantages of CT and FT devices for better accuracy. The key idea is to add an encoder selection network to choose the optimal input source (CT or FT) and the corresponding encoder. We use a single-channel encoder for CT speech and a multi-channel encoder with Spatial Filtering neural beamforming for FT speech, which are jointly trained with the encoder selection. We validate our approach on both attention-based and RNN Transducer end-to-end ASR systems. The experiments are done with conversational speech from a medical use case, which is recorded simultaneously with a CT device and a microphone array. Our results show that the proposed dual-encoder architecture obtains up to 9% relative WER reduction when using both CT and FT input, compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
