Generative Spoken Language Model based on continuous word-sized audio tokens
Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel, Synnaeve, Benoit Sagot, Emmanuel Dupoux

TL;DR
This paper introduces the first generative spoken language model using continuous word-sized audio embeddings, achieving comparable quality to discrete models while being more memory-efficient and interpretable.
Contribution
It presents a novel GSLM based on continuous audio embeddings, replacing traditional discrete units, and demonstrates its effectiveness and interpretability.
Findings
Performance comparable to discrete unit GSLMs in quality
Five times more memory efficient due to larger units
Embeddings are phonetically and semantically interpretable
Abstract
In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
Methodsk-Nearest Neighbors
