It's Raw! Audio Generation with State-Space Models

Karan Goel; Albert Gu; Chris Donahue; Christopher R\'e

arXiv:2202.09729·cs.SD·February 22, 2022·21 cites

It's Raw! Audio Generation with State-Space Models

Karan Goel, Albert Gu, Chris Donahue, Christopher R\'e

PDF

Open Access 5 Repos 3 Models 4 Datasets

TL;DR

SaShiMi introduces a novel multi-scale waveform modeling architecture based on S4, achieving state-of-the-art results in unconditional audio generation, outperforming prior models like WaveNet in quality and efficiency.

Contribution

The paper presents SaShiMi, a new architecture for raw audio modeling that improves stability, performance, and efficiency over existing models by leveraging the S4 model and novel parameterization techniques.

Findings

01

SaShiMi achieves 2x better mean opinion scores than WaveNet for speech.

02

SaShiMi outperforms WaveNet in density estimation and speed with fewer parameters.

03

SaShiMi improves both autoregressive and non-autoregressive audio generation tasks.

Abstract

Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion · Mixture of Logistic Distributions · Dilated Causal Convolution · WaveNet