SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond
Marco Comunit\`a, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie, Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, Yuki, Mitsufuji

TL;DR
SpecMaskGIT is a lightweight, efficient masked generative model for audio spectrograms that synthesizes 10-second clips in fewer than 16 iterations, outperforming larger models in quality and speed, and enabling diverse applications.
Contribution
We introduce SpecMaskGIT, a novel masked generative model for spectrograms that significantly reduces inference iterations and computational cost while maintaining high-quality audio synthesis.
Findings
Synthesizes 10s audio clips in less than 16 iterations.
Outperforms larger models like VQ-Diffusion in quality.
Achieves real-time synthesis on CPU and is 30x faster on GPU.
Abstract
Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of model parameters. To address the challenges, we propose SpecMaskGIT, a light-weighted, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10s audio clip by less than 16 iterations, an order-of-magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in the TTA benchmark, while being real-time with only 4 CPU cores or even 30x…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
