Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models

Teysir Baoueb; Xiaoyu Bie; Xi Wang; Ga\"el Richard

arXiv:2506.15530·cs.SD·June 19, 2025

Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models

Teysir Baoueb, Xiaoyu Bie, Xi Wang, Ga\"el Richard

PDF

Open Access

TL;DR

This paper introduces Diff-TONE, a method for instrument editing in text-to-music diffusion models that optimizes the editing process by selecting an intermediate timestep, preserving content while changing instrument timbre without extra training.

Contribution

The paper proposes a novel timestep optimization technique for instrument editing in text-to-music diffusion models, enhancing control without additional training or speed loss.

Findings

01

Intermediate timestep selection improves instrument editing quality.

02

The method preserves original content while changing instrument timbre.

03

No additional training required, maintaining model speed.

Abstract

Breakthroughs in text-to-music generation models are transforming the creative landscape, equipping musicians with innovative tools for composition and experimentation like never before. However, controlling the generation process to achieve a specific desired outcome remains a significant challenge. Even a minor change in the text prompt, combined with the same random seed, can drastically alter the generated piece. In this paper, we explore the application of existing text-to-music diffusion models for instrument editing. Specifically, for an existing audio track, we aim to leverage a pretrained text-to-music diffusion model to edit the instrument while preserving the underlying content. Based on the insight that the model first focuses on the overall structure or content of the audio, then adds instrument information, and finally refines the quality, we show that selecting a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis

MethodsDiffusion