TOGGL: Transcribing Overlapping Speech with Staggered Labeling

Chak-Fai Li; William Hartmann; Matthew Snover

arXiv:2408.06474·cs.CL·August 14, 2024

TOGGL: Transcribing Overlapping Speech with Staggered Labeling

Chak-Fai Li, William Hartmann, Matthew Snover

PDF

Open Access

TL;DR

TOGGL introduces a novel single-decoder model that transcribes overlapping speech from multiple speakers using special tokens, outperforming existing methods and generalizing beyond two speakers.

Contribution

The paper presents TOGGL, a unified model that transcribes multiple overlapping speakers with a single decoder, eliminating the need for separate decoders or streams.

Findings

01

Outperforms competing approaches on conversational speech datasets

02

Generalizes beyond two speakers even when trained on two-speaker data

03

Improves transcription accuracy on single-speaker audio

Abstract

Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model to simultaneously transcribe the speech of multiple speakers. The TOGGL model uses special output tokens to attribute the speech to each speaker with only a single decoder. Our approach generalizes beyond two speakers, even when trained only on two-speaker data. We demonstrate superior performance compared to competing approaches on a conversational speech dataset. Our approach also improves performance on single-speaker audio.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Speech Recognition and Synthesis · Phonetics and Phonology Research