DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music   Generation

Zachary Novack; Julian McAuley; Taylor Berg-Kirkpatrick; Nicholas; Bryan

arXiv:2405.20289·cs.SD·May 31, 2024

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas, Bryan

PDF

Open Access

TL;DR

DITTO-2 is a novel method that significantly accelerates diffusion-based music generation, enabling faster-than-real-time control, improved quality, and enhanced adherence to control signals through model distillation and inference-time optimization.

Contribution

The paper introduces DITTO-2, a distillation-based approach that speeds up diffusion inference for music generation and enhances control and quality, including a new application for text adherence.

Findings

01

Speeds up music generation 10-20x over previous methods.

02

Improves control adherence and generation quality simultaneously.

03

Enables state-of-the-art text control in diffusion models.

Abstract

Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Neuroscience and Music Perception

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion