TL;DR
This paper introduces a new benchmark for long multi-speaker conversations, compares joint and separate models for speech recognition and diarization, and proposes methods to improve performance on extended dialogues.
Contribution
It presents a new long-form podcast dataset, evaluates joint versus separate models, and introduces a striding attention decoding algorithm for better handling of long conversations.
Findings
Joint models outperform separate models when utterance boundaries are unknown.
The proposed decoding algorithm improves ASR and SD performance on long conversations.
Data augmentation and pre-training enhance model accuracy on extended dialogues.
Abstract
Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly This American Life radio program to better compare these approaches when applied to extended multi-speaker conversations. We find that training separate ASR and SD models perform better when utterance boundaries are known but otherwise joint models can perform better. To handle long conversations with unknown utterance boundaries, we introduce a striding attention decoding algorithm and data augmentation techniques which, combined with model pre-training, improves ASR and SD.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗pyannote/speaker-diarizationmodel· 741k dl· ♡ 1249741k dl♡ 1249
- 🤗nvidia/canary-1b-v2model· 123k dl· ♡ 371123k dl♡ 371
- 🤗bhuvanesh25/pyannote-diar-copymodel· 2 dl2 dl
- 🤗paris-iea/speaker-diarizationmodel· 9 dl· ♡ 19 dl♡ 1
- 🤗hicustomer/pyannote-speaker-diarizationmodel· 16 dl· ♡ 116 dl♡ 1
- 🤗sagnik-p/speakerdmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)
