Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning
Robin Algayres, Adel Nabli, Benoit Sagot, Emmanuel Dupoux

TL;DR
This paper presents a neural encoder trained with contrastive learning using data-augmented k-NN positives, achieving state-of-the-art speech sequence embedding results across multiple languages and tasks.
Contribution
It introduces an unsupervised contrastive learning method leveraging k-NN search for speech embeddings, improving performance on speech retrieval and discovery tasks.
Findings
Achieves state-of-the-art results on query-by-example speech tasks
Effective across five languages and two major tasks
Establishes a new benchmark on LibriSpeech dataset
Abstract
We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations, this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-by-example task on the LibriSpeech dataset to monitor future improvements in the field.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsContrastive Learning · Stochastic Steady-state Embedding
