Speaker conditioned acoustic modeling for multi-speaker conversational   ASR

Srikanth Raj Chetupalli; Sriram Ganapathy

arXiv:2104.01882·eess.AS·August 30, 2022·Interspeech

Speaker conditioned acoustic modeling for multi-speaker conversational ASR

Srikanth Raj Chetupalli, Sriram Ganapathy

PDF

Open Access

TL;DR

This paper introduces a speaker conditioned acoustic model for multi-speaker conversational ASR that leverages speaker diarization and joint optimization, significantly reducing word error rates in overlapping speech scenarios.

Contribution

It presents a novel speaker conditioned acoustic model that integrates speaker activity inputs and joint training, improving multi-speaker ASR performance over existing methods.

Findings

01

Achieved 12% relative WER reduction over baseline

02

Effective integration of speaker diarization with ASR

03

Joint optimization enhances transcription accuracy

Abstract

In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel speech recordings. The proposed model is a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system. The speaker conditioned acoustic model (SCAM) in the ASR system consists of a series of embedding layers which use the speaker activity inputs from the diarization system to derive speaker specific embeddings. The output of the SCAM are speaker specific senones that are used for decoding the transcripts for each speaker in the conversation. In this work, we experiment with the automatic speaker activity decisions generated using an end-to-end speaker diarization system. A joint learning approach is also proposed where the diarization model and the ASR acoustic model are jointly optimized. The experiments are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques