Generative Spoken Language Model based on continuous word-sized audio   tokens

Robin Algayres; Yossi Adi; Tu Anh Nguyen; Jade Copet; Gabriel; Synnaeve; Benoit Sagot; Emmanuel Dupoux

arXiv:2310.05224·cs.CL·October 10, 2023

Generative Spoken Language Model based on continuous word-sized audio tokens

Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel, Synnaeve, Benoit Sagot, Emmanuel Dupoux

PDF

Open Access

TL;DR

This paper introduces the first generative spoken language model using continuous word-sized audio embeddings, achieving comparable quality to discrete models while being more memory-efficient and interpretable.

Contribution

It presents a novel GSLM based on continuous audio embeddings, replacing traditional discrete units, and demonstrates its effectiveness and interpretability.

Findings

01

Performance comparable to discrete unit GSLMs in quality

02

Five times more memory efficient due to larger units

03

Embeddings are phonetically and semantically interpretable

Abstract

In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

Methodsk-Nearest Neighbors