Audio Language Modeling using Perceptually-Guided Discrete   Representations

Felix Kreuk; Yaniv Taigman; Adam Polyak; Jade Copet; Gabriel Synnaeve,; Alexandre D\'efossez; Yossi Adi

arXiv:2211.01223·cs.SD·November 7, 2022·1 cites

Audio Language Modeling using Perceptually-Guided Discrete Representations

Felix Kreuk, Yaniv Taigman, Adam Polyak, Jade Copet, Gabriel Synnaeve,, Alexandre D\'efossez, Yossi Adi

PDF

Open Access

TL;DR

This paper introduces a novel approach to audio language modeling by combining perceptually-guided discrete audio representations with transformer-based models, enabling effective audio generation and completion with superior quality on large datasets.

Contribution

The work presents a new method that leverages perceptually-guided discrete audio representations for training transformer-based language models for audio, improving generation quality.

Findings

01

Superior audio sample quality compared to baseline encoders

02

Effective audio auto-completion using discrete representations

03

Analysis of trade-offs between audio quality and modeling capabilities

Abstract

In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to encode audio to discrete representations. Next, we train a transformer-based causal language model using these representations. At inference time, we perform audio auto-completion by encoding an audio prompt as a discrete sequence, feeding it to the audio language model, sampling from the model, and synthesizing the corresponding time-domain signal. We evaluate the quality of samples generated by our method on Audioset, the largest dataset for general audio to date, and show that it is superior to the evaluated baseline audio encoders. We additionally provide an extensive analysis to better understand the trade-off between audio-quality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis