SoundLoCD: An Efficient Conditional Discrete Contrastive Latent   Diffusion Model for Text-to-Sound Generation

Xinlei Niu; Jing Zhang; Christian Walder; Charles Patrick Martin

arXiv:2405.15338·cs.SD·May 27, 2024

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Xinlei Niu, Jing Zhang, Christian Walder, Charles Patrick Martin

PDF

Open Access

TL;DR

SoundLoCD is an efficient text-to-sound generation model that uses a conditional discrete contrastive latent diffusion approach, achieving high-quality results with limited computational resources and improved text-output coherence.

Contribution

The paper introduces a novel, resource-efficient diffusion model with contrastive learning for text-to-sound generation, outperforming existing methods.

Findings

01

Outperforms baseline models in quality and efficiency

02

Requires significantly less computational resources

03

Contrastive learning enhances text-sound coherence

Abstract

We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis