TOGGL: Transcribing Overlapping Speech with Staggered Labeling
Chak-Fai Li, William Hartmann, Matthew Snover

TL;DR
TOGGL introduces a novel single-decoder model that transcribes overlapping speech from multiple speakers using special tokens, outperforming existing methods and generalizing beyond two speakers.
Contribution
The paper presents TOGGL, a unified model that transcribes multiple overlapping speakers with a single decoder, eliminating the need for separate decoders or streams.
Findings
Outperforms competing approaches on conversational speech datasets
Generalizes beyond two speakers even when trained on two-speaker data
Improves transcription accuracy on single-speaker audio
Abstract
Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model to simultaneously transcribe the speech of multiple speakers. The TOGGL model uses special output tokens to attribute the speech to each speaker with only a single decoder. Our approach generalizes beyond two speakers, even when trained only on two-speaker data. We demonstrate superior performance compared to competing approaches on a conversational speech dataset. Our approach also improves performance on single-speaker audio.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Speech Recognition and Synthesis · Phonetics and Phonology Research
