Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings
Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng,, Takuya Yoshioka

TL;DR
This paper introduces a novel hypothesis stitcher method for end-to-end speaker-attributed ASR, significantly improving word error rates on long-form multi-talker recordings by effectively fusing hypotheses from short segments.
Contribution
It proposes a new sequence-to-sequence hypothesis stitcher that enhances E2E SA-ASR performance on long recordings, addressing mismatch issues in training and testing conditions.
Findings
Significant reduction in SA-WER on LibriSpeech and LibriCSS datasets.
The hypothesis stitcher outperforms conventional decoding methods.
Architectural variations of the stitcher show consistent improvements.
Abstract
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
