Msanii: High Fidelity Music Synthesis on a Shoestring Budget
Kinyugo Maina

TL;DR
Msanii is a novel diffusion-based model that synthesizes long, high-fidelity stereo music efficiently at high sample rates, demonstrating the first successful application of diffusion models for such long music samples.
Contribution
Introduces Msanii, a diffusion-based music synthesis model capable of generating long, high-quality stereo music at high sample rates, a first in the field.
Findings
Synthesizes 190 seconds of stereo music at 44.1 kHz
Does not rely on concatenative or cascading synthesis techniques
Achieves high-fidelity music synthesis with diffusion models
Abstract
In this paper, we present Msanii, a novel diffusion-based model for synthesizing long-context, high-fidelity music efficiently. Our model combines the expressiveness of mel spectrograms, the generative capabilities of diffusion models, and the vocoding capabilities of neural vocoders. We demonstrate the effectiveness of Msanii by synthesizing tens of seconds (190 seconds) of stereo music at high sample rates (44.1 kHz) without the use of concatenative synthesis, cascading architectures, or compression techniques. To the best of our knowledge, this is the first work to successfully employ a diffusion-based model for synthesizing such long music samples at high sample rates. Our demo can be found https://kinyugo.github.io/msanii-demo and our code https://github.com/Kinyugo/msanii .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Computer Graphics and Visualization Techniques
MethodsDiffusion
