Speaker conditioned acoustic modeling for multi-speaker conversational ASR
Srikanth Raj Chetupalli, Sriram Ganapathy

TL;DR
This paper introduces a speaker conditioned acoustic model for multi-speaker conversational ASR that leverages speaker diarization and joint optimization, significantly reducing word error rates in overlapping speech scenarios.
Contribution
It presents a novel speaker conditioned acoustic model that integrates speaker activity inputs and joint training, improving multi-speaker ASR performance over existing methods.
Findings
Achieved 12% relative WER reduction over baseline
Effective integration of speaker diarization with ASR
Joint optimization enhances transcription accuracy
Abstract
In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel speech recordings. The proposed model is a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system. The speaker conditioned acoustic model (SCAM) in the ASR system consists of a series of embedding layers which use the speaker activity inputs from the diarization system to derive speaker specific embeddings. The output of the SCAM are speaker specific senones that are used for decoding the transcripts for each speaker in the conversation. In this work, we experiment with the automatic speaker activity decisions generated using an end-to-end speaker diarization system. A joint learning approach is also proposed where the diarization model and the ASR acoustic model are jointly optimized. The experiments are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
