kNN For Whisper And Its Effect On Bias And Speaker Adaptation

Maya K. Nachesa; Vlad Niculae

arXiv:2410.18850·cs.CL·February 12, 2025

kNN For Whisper And Its Effect On Bias And Speaker Adaptation

Maya K. Nachesa, Vlad Niculae

PDF

Open Access 1 Video

TL;DR

This paper explores how token-level k-nearest neighbor search ($k$NN) enhances Whisper speech recognition, particularly in addressing bias and speaker adaptation issues without retraining the model.

Contribution

It demonstrates the application of $k$NN to Whisper, analyzing its effects on bias, gender, accent, and age-related speaker adaptation in speech recognition.

Findings

01

$k$NN improves Whisper's recognition accuracy across different speaker groups.

02

Using $k$NN reduces bias related to gender, accent, and age.

03

The method offers a non-parametric alternative for speaker adaptation.

Abstract

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level $k$ nearest neighbor search ( $k$ NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$ NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

kNN For Whisper And Its Effect On Bias And Speaker Adaptation· underline

Taxonomy

TopicsDigital Rights Management and Security