Improving Musical Accompaniment Co-creation via Diffusion Transformers
Javier Nistal, Marco Pasini, and Stefan Lattner

TL;DR
This paper enhances a diffusion-based model for musical accompaniment by improving audio fidelity, diversity, inference speed, and text control through architectural upgrades and training strategies.
Contribution
It introduces a stereo-capable autoencoder, a diffusion transformer, and a cross-modality text translation, advancing the quality and controllability of musical accompaniment generation.
Findings
Improved audio fidelity and diversity in generated music.
Faster inference with fewer denoising steps without quality loss.
Enhanced text-driven control via cross-modality translation.
Abstract
Building upon Diff-A-Riff, a latent diffusion model for musical instrument accompaniment generation, we present a series of improvements targeting quality, diversity, inference speed, and text-driven control. First, we upgrade the underlying autoencoder to a stereo-capable model with superior fidelity and replace the latent U-Net with a Diffusion Transformer. Additionally, we refine text prompting by training a cross-modality predictive network to translate text-derived CLAP embeddings to audio-derived CLAP embeddings. Finally, we improve inference speed by training the latent model using a consistency framework, achieving competitive quality with fewer denoising steps. Our model is evaluated against the original Diff-A-Riff variant using objective metrics in ablation experiments, demonstrating promising advancements in all targeted areas. Sound examples are available at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Tactile and Sensory Interactions
