Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems
Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg

TL;DR
Sortformer introduces a permutation-resolving loss and architecture for improved speaker diarization and multi-speaker speech-to-text, enhancing accuracy and integration into larger models.
Contribution
It proposes a novel Sort Loss and a streamlined multi-speaker speech-to-text architecture that effectively address speaker permutation issues in speech-to-text systems.
Findings
Sort Loss improves speaker diarization performance.
Incorporating speaker supervision enhances multi-speaker transcription accuracy.
The approach facilitates seamless integration of speaker tagging into speech-to-text models.
Abstract
Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/diar_sortformer_4spk-v1model· 5.3k dl· ♡ 1375.3k dl♡ 137
- 🤗nvidia/diar_streaming_sortformer_4spk-v2model· 23k dl· ♡ 11123k dl♡ 111
- 🤗nvidia/diar_streaming_sortformer_4spk-v2.1model· 6.5k dl· ♡ 596.5k dl♡ 59
- 🤗nvidia/ssl_en_nest_large_v1.0model· 57 dl· ♡ 857 dl♡ 8
- 🤗nvidia/ssl_en_nest_xlarge_v1.0model· 156 dl· ♡ 7156 dl♡ 7
- 🤗nvidia/multitalker-parakeet-streaming-0.6b-v1model· 525 dl· ♡ 94525 dl♡ 94
- 🤗aufklarer/Sortformer-Diarization-CoreMLmodel· 366 dl366 dl
- 🤗everyscribe/diar_streaming_sortformer_4spk-v2model· 3 dl3 dl
- 🤗thoratsr7/multitalker-parakeet-streaming-0.6b-v1model
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsAdapter
