SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for   Efficient Audio Synthesis and Beyond

Marco Comunit\`a; Zhi Zhong; Akira Takahashi; Shiqi Yang; Mengjie; Zhao; Koichi Saito; Yukara Ikemiya; Takashi Shibuya; Shusuke Takahashi; Yuki; Mitsufuji

arXiv:2406.17672·cs.SD·June 27, 2024

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Marco Comunit\`a, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie, Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, Yuki, Mitsufuji

PDF

Open Access

TL;DR

SpecMaskGIT is a lightweight, efficient masked generative model for audio spectrograms that synthesizes 10-second clips in fewer than 16 iterations, outperforming larger models in quality and speed, and enabling diverse applications.

Contribution

We introduce SpecMaskGIT, a novel masked generative model for spectrograms that significantly reduces inference iterations and computational cost while maintaining high-quality audio synthesis.

Findings

01

Synthesizes 10s audio clips in less than 16 iterations.

02

Outperforms larger models like VQ-Diffusion in quality.

03

Achieves real-time synthesis on CPU and is 30x faster on GPU.

Abstract

Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of model parameters. To address the challenges, we propose SpecMaskGIT, a light-weighted, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10s audio clip by less than 16 iterations, an order-of-magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in the TTA benchmark, while being real-time with only 4 CPU cores or even 30x…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies