Improving Musical Accompaniment Co-creation via Diffusion Transformers

Javier Nistal; Marco Pasini; and Stefan Lattner

arXiv:2410.23005·cs.SD·October 31, 2024

Improving Musical Accompaniment Co-creation via Diffusion Transformers

Javier Nistal, Marco Pasini, and Stefan Lattner

PDF

Open Access

TL;DR

This paper enhances a diffusion-based model for musical accompaniment by improving audio fidelity, diversity, inference speed, and text control through architectural upgrades and training strategies.

Contribution

It introduces a stereo-capable autoencoder, a diffusion transformer, and a cross-modality text translation, advancing the quality and controllability of musical accompaniment generation.

Findings

01

Improved audio fidelity and diversity in generated music.

02

Faster inference with fewer denoising steps without quality loss.

03

Enhanced text-driven control via cross-modality translation.

Abstract

Building upon Diff-A-Riff, a latent diffusion model for musical instrument accompaniment generation, we present a series of improvements targeting quality, diversity, inference speed, and text-driven control. First, we upgrade the underlying autoencoder to a stereo-capable model with superior fidelity and replace the latent U-Net with a Diffusion Transformer. Additionally, we refine text prompting by training a cross-modality predictive network to translate text-derived CLAP embeddings to audio-derived CLAP embeddings. Finally, we improve inference speed by training the latent model using a consistency framework, achieving competitive quality with fewer denoising steps. Our model is evaluated against the original Diff-A-Riff variant using objective metrics in ablation experiments, demonstrating promising advancements in all targeted areas. Sound examples are available at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Tactile and Sensory Interactions