Speech Recognition and Multi-Speaker Diarization of Long Conversations

Huanru Henry Mao; Shuyang Li; Julian McAuley; Garrison Cottrell

arXiv:2005.08072·eess.AS·November 6, 2020

Speech Recognition and Multi-Speaker Diarization of Long Conversations

Huanru Henry Mao, Shuyang Li, Julian McAuley, Garrison Cottrell

PDF

3 Repos 6 Models

TL;DR

This paper introduces a new benchmark for long multi-speaker conversations, compares joint and separate models for speech recognition and diarization, and proposes methods to improve performance on extended dialogues.

Contribution

It presents a new long-form podcast dataset, evaluates joint versus separate models, and introduces a striding attention decoding algorithm for better handling of long conversations.

Findings

01

Joint models outperform separate models when utterance boundaries are unknown.

02

The proposed decoding algorithm improves ASR and SD performance on long conversations.

03

Data augmentation and pre-training enhance model accuracy on extended dialogues.

Abstract

Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly This American Life radio program to better compare these approaches when applied to extended multi-speaker conversations. We find that training separate ASR and SD models perform better when utterance boundaries are known but otherwise joint models can perform better. To handle long conversations with unknown utterance boundaries, we introduce a striding attention decoding algorithm and data augmentation techniques which, combined with model pre-training, improves ASR and SD.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)