It's Raw! Audio Generation with State-Space Models
Karan Goel, Albert Gu, Chris Donahue, Christopher R\'e

TL;DR
SaShiMi introduces a novel multi-scale waveform modeling architecture based on S4, achieving state-of-the-art results in unconditional audio generation, outperforming prior models like WaveNet in quality and efficiency.
Contribution
The paper presents SaShiMi, a new architecture for raw audio modeling that improves stability, performance, and efficiency over existing models by leveraging the S4 model and novel parameterization techniques.
Findings
SaShiMi achieves 2x better mean opinion scores than WaveNet for speech.
SaShiMi outperforms WaveNet in density estimation and speed with fewer parameters.
SaShiMi improves both autoregressive and non-autoregressive audio generation tasks.
Abstract
Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion · Mixture of Logistic Distributions · Dilated Causal Convolution · WaveNet
