Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia; Evgeny Kharitonov; Wei-Ning Hsu; Yossi Adi; Adam; Polyak; Benjamin Bolte; Tu-Anh Nguyen; Jade Copet; Alexei Baevski; Adelrahman; Mohamed; Emmanuel Dupoux

arXiv:2102.01192·cs.CL·September 13, 2021

Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam, Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman, Mohamed, Emmanuel Dupoux

PDF

2 Repos 1 Models

TL;DR

This paper introduces a new task called Generative Spoken Language Modeling that learns language representations directly from raw audio without supervision, and proposes metrics to evaluate these models at acoustic and linguistic levels.

Contribution

It presents the first framework for learning and evaluating spoken language models directly from raw audio without text or labels, including baseline systems and evaluation metrics.

Findings

01

Number of discrete units affects performance depending on encoder and task

02

Some model combinations approach text-based system performance

03

Validation through human evaluation confirms metric effectiveness

Abstract

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
jzx-ai-lab/flow_mirror
model· 4 dl· ♡ 2
4 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.