Masked Audio Generation using a Single Non-Autoregressive Transformer

Alon Ziv; Itai Gat; Gael Le Lan; Tal Remez; Felix Kreuk; Alexandre; D\'efossez; Jade Copet; Gabriel Synnaeve; Yossi Adi

arXiv:2401.04577·cs.SD·March 6, 2024·1 cites

Masked Audio Generation using a Single Non-Autoregressive Transformer

Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre, D\'efossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

PDF

Open Access 10 Models 1 Video 3 Reviews

TL;DR

MAGNeT is a novel non-autoregressive transformer for audio generation that predicts masked audio tokens, uses rescoring with a pre-trained model, and combines autoregressive and non-autoregressive methods for faster, high-quality text-to-audio synthesis.

Contribution

Introduces MAGNeT, a single-stage non-autoregressive transformer for audio generation, with a novel rescoring method and hybrid decoding approach, improving speed and quality over prior autoregressive models.

Findings

01

MAGNeT is 7 times faster than autoregressive baselines.

02

The model achieves comparable quality to existing methods.

03

Hybrid decoding improves initial output quality while maintaining efficiency.

Abstract

We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The proposed masking strategy, which uses token spans instead of individual tokens, is an important methodological contribution of the paper since it greatly impacts the performance of the proposed model. - The 10x speedup in latency is remarkable, especially in a setting like music, where it is required to sample many tokens, given the long context.

Weaknesses

- A problem with the work is related to the title and the overall tone, in which the authors claim proposing a new type of audio/music model. In a paper titled "Masked Audio Generative Modeling," it is supposed that the authors proposed for the first time a masked model for audio. Nonetheless, the papers introducing such an idea in audio are SoundStorm (which the authors cite correctly) and VampNet https://arxiv.org/abs/2307.04686 (which the authors should cite as concurrent work). The authors c

Reviewer 02Rating 8· accept, good paperConfidence 5

Strengths

The choice of spanned mask prediction training with restricted context for RVQ is well justified both intuitively and through ablation studies. The use of a shifted impulse function for analyzing EnCodec's latent vector is a welcomed addition, especially to better understand the dependency between multi-level RVQ tokens, which will inspire the community to improve the methods to train the codec model and/or the generative model alike. The accuracy-latency tradeoff analysis for the first-level R

Weaknesses

The ablation study is satisfactory overall, but I would also like to see quantitative analysis on the proposed annealed classifier-free guidance scale, which seems to be missing in the current manuscript. While intuitive, documenting the performance gain obtained by the method would make the paper more convincing and complete. Although the manuscript states that the training dataset is the same for several baseline models (Mousai and MusicGen), others (MusicLM and AudioLDM2) are trained on diff

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

The authors of this research paper address the challenging problem of generating long audio and music sequences, a task that has garnered significant attention due to its relevance in various applications, such as text-to-music and text-to-audio generation. Their proposed solution, known as MAGNET (Masked Audio Generation using Non-autoregressive Transformer), introduces several innovative techniques to tackle this problem effectively. One key feature of MAGNET is its use of a training via mask

Weaknesses

The main weakness of this paper is that there is a heavy amount of engineering involved in the development of this model. While, it is definitely commendable, it makes reproducing the result extremely difficult for the rest of research community. Further, authors have not referred to this work titled "Masked Autoencoders that Listen" which has a very similar idea of span masking. It would have been interesting to see the contrast and comparison against something that is designed with similar id

Code & Models

Models

Videos

Masked Audio Generation using a Single Non-Autoregressive Transformer· slideslive

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing