Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems

Taejin Park; Ivan Medennikov; Kunal Dhawan; Weiqing Wang; He Huang; Nithin Rao Koluguri; Krishna C. Puvvada; Jagadeesh Balam; Boris Ginsburg

arXiv:2409.06656·eess.AS·July 22, 2025

Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems

Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access 1 Repo 9 Models 1 Video

TL;DR

Sortformer introduces a permutation-resolving loss and architecture for improved speaker diarization and multi-speaker speech-to-text, enhancing accuracy and integration into larger models.

Contribution

It proposes a novel Sort Loss and a streamlined multi-speaker speech-to-text architecture that effectively address speaker permutation issues in speech-to-text systems.

Findings

01

Sort Loss improves speaker diarization performance.

02

Incorporating speaker supervision enhances multi-speaker transcription accuracy.

03

The approach facilitates seamless integration of speaker tagging into speech-to-text models.

Abstract

Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVIDIA/NeMo
pytorchOfficial

Models

Videos

Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsAdapter