Simulating realistic speech overlaps improves multi-talker ASR

Muqiao Yang; Naoyuki Kanda; Xiaofei Wang; Jian Wu; Sunit Sivasankaran,; Zhuo Chen; Jinyu Li; Takuya Yoshioka

arXiv:2210.15715·eess.AS·November 21, 2022·1 cites

Simulating realistic speech overlaps improves multi-talker ASR

Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran,, Zhuo Chen, Jinyu Li, Takuya Yoshioka

PDF

Open Access

TL;DR

This paper introduces a novel method for simulating realistic multi-talker speech overlaps using a statistical language model, leading to improved automatic speech recognition performance in overlapping speech scenarios.

Contribution

The paper presents a new technique to generate realistic multi-talker speech overlaps by modeling overlap patterns with a language model, enhancing training data quality for ASR.

Findings

01

Improved word error rates across multiple datasets.

02

Realistic overlap simulation benefits multi-talker ASR.

03

Method outperforms naive mixing approaches.

Abstract

Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers. Due to the difficulty in acquiring real conversation data with high-quality human transcriptions, a na\"ive simulation of multi-talker speech by randomly mixing multiple utterances was conventionally used for model training. In this work, we propose an improved technique to simulate multi-talker overlapping speech with realistic speech overlaps, where an arbitrary pattern of speech overlaps is represented by a sequence of discrete tokens. With this representation, speech overlapping patterns can be learned from real conversations based on a statistical language model, such as N-gram, which can be then used to generate multi-talker speech for training. In our experiments, multi-talker ASR models trained with the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems