SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong,, Yuhta Takida, Yuki Mitsufuji

TL;DR
SoundCTM introduces a unified framework that combines fast 1-step and high-quality multi-step deterministic sound generation, enabling efficient and flexible sound content creation for multimedia applications.
Contribution
It reframes the CTM training framework for sound generation and scales it to 1B parameters, achieving both fast and high-quality deterministic full-band sound synthesis.
Findings
Achieves high-quality 1-step sound generation.
Enables deterministic multi-step sound refinement.
Scales up to 1B parameters for production-level quality.
Abstract
Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these models often suffer from slow inference speeds, imposing an undesirable burden that hinders the trial-and-error process. While existing T2S distillation models address this limitation through 1-step generation, the sample quality of -step generation remains insufficient for production use. Additionally, while multi-step sampling in those distillation models improves sample quality itself, the semantic content changes due to their lack of deterministic sampling capabilities. To address these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Natural Language Processing Techniques
MethodsALIGN
