Audio Language Modeling using Perceptually-Guided Discrete Representations
Felix Kreuk, Yaniv Taigman, Adam Polyak, Jade Copet, Gabriel Synnaeve,, Alexandre D\'efossez, Yossi Adi

TL;DR
This paper introduces a novel approach to audio language modeling by combining perceptually-guided discrete audio representations with transformer-based models, enabling effective audio generation and completion with superior quality on large datasets.
Contribution
The work presents a new method that leverages perceptually-guided discrete audio representations for training transformer-based language models for audio, improving generation quality.
Findings
Superior audio sample quality compared to baseline encoders
Effective audio auto-completion using discrete representations
Analysis of trade-offs between audio quality and modeling capabilities
Abstract
In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to encode audio to discrete representations. Next, we train a transformer-based causal language model using these representations. At inference time, we perform audio auto-completion by encoding an audio prompt as a discrete sequence, feeding it to the audio language model, sampling from the model, and synthesizing the corresponding time-domain signal. We evaluate the quality of samples generated by our method on Audioset, the largest dataset for general audio to date, and show that it is superior to the evaluated baseline audio encoders. We additionally provide an extensive analysis to better understand the trade-off between audio-quality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
