Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Robin Algayres; Adel Nabli; Benoit Sagot; Emmanuel Dupoux

arXiv:2204.05148·cs.AI·October 24, 2023·1 cites

Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Robin Algayres, Adel Nabli, Benoit Sagot, Emmanuel Dupoux

PDF

Open Access

TL;DR

This paper presents a neural encoder trained with contrastive learning using data-augmented k-NN positives, achieving state-of-the-art speech sequence embedding results across multiple languages and tasks.

Contribution

It introduces an unsupervised contrastive learning method leveraging k-NN search for speech embeddings, improving performance on speech retrieval and discovery tasks.

Findings

01

Achieves state-of-the-art results on query-by-example speech tasks

02

Effective across five languages and two major tasks

03

Establishes a new benchmark on LibriSpeech dataset

Abstract

We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations, this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-by-example task on the LibriSpeech dataset to monitor future improvements in the field.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsContrastive Learning · Stochastic Steady-state Embedding