SoundCTM: Unifying Score-based and Consistency Models for Full-band   Text-to-Sound Generation

Koichi Saito; Dongjun Kim; Takashi Shibuya; Chieh-Hsin Lai; Zhi Zhong,; Yuhta Takida; Yuki Mitsufuji

arXiv:2405.18503·cs.SD·March 11, 2025

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong,, Yuhta Takida, Yuki Mitsufuji

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

SoundCTM introduces a unified framework that combines fast 1-step and high-quality multi-step deterministic sound generation, enabling efficient and flexible sound content creation for multimedia applications.

Contribution

It reframes the CTM training framework for sound generation and scales it to 1B parameters, achieving both fast and high-quality deterministic full-band sound synthesis.

Findings

01

Achieves high-quality 1-step sound generation.

02

Enables deterministic multi-step sound refinement.

03

Scales up to 1B parameters for production-level quality.

Abstract

Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these models often suffer from slow inference speeds, imposing an undesirable burden that hinders the trial-and-error process. While existing T2S distillation models address this limitation through 1-step generation, the sample quality of $1$ -step generation remains insufficient for production use. Additionally, while multi-step sampling in those distillation models improves sample quality itself, the semantic content changes due to their lack of deterministic sampling capabilities. To address these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sony/soundctm
pytorchOfficial

Models

🤗
Sony/soundctm
model· ♡ 18
♡ 18

Videos

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation· slideslive

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Natural Language Processing Techniques

MethodsALIGN