Are discrete units necessary for Spoken Language Modeling?
Tu Anh Nguyen, Benoit Sagot, Emmanuel Dupoux

TL;DR
This paper investigates whether discrete units are necessary for effective spoken language modeling and finds that discretization improves performance by filtering irrelevant information, leading to state-of-the-art results.
Contribution
The study demonstrates the importance of discrete units in spoken language modeling and shows that discretization enhances language model performance by removing irrelevant features.
Findings
Discretization is essential for good spoken language modeling results.
Discrete units improve language modeling by filtering irrelevant information.
Achieved state-of-the-art results on the Zero Resource Speech Challenge 2021.
Abstract
Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
