DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Trained Speech Foundational Model

Massa Baali; Rita Singh; Bhiksha Raj

arXiv:2510.17662·cs.SD·March 26, 2026

DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Trained Speech Foundational Model

Massa Baali, Rita Singh, Bhiksha Raj

PDF

Open Access 1 Models

TL;DR

DELULU is a novel self-supervised speech model that incorporates speaker information into training, significantly improving speaker verification and profiling tasks without requiring task-specific fine-tuning.

Contribution

It introduces speaker-aware pseudo-labeling using a speaker verification model to guide clustering, enhancing speaker-discriminative features in self-supervised speech representations.

Findings

01

Up to 62% relative improvement in speaker verification EER

02

Consistent gains in zero-shot profiling tasks

03

Surpasses teacher model on zero-shot evaluations

Abstract

Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce \textsc{DELULU}, a speaker-aware self-trained foundational model that addresses this limitation by incorporating speaker-informed structure into pseudo-label generation. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide k-means clustering during pre-training, introducing a speaker-discriminative inductive bias that aligns representation learning with speaker identity. DELULU significantly outperforms prior SSL models across a range of speaker-centric tasks, achieving up to \textbf{62\% relative improvement} in equal error rate (EER) for speaker verification and consistent gains on zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
cmu-mlsp/DELULU
model· 3 dl· ♡ 1
3 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Authorship Attribution and Profiling