SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Michael Ungersb\"ock; Florian Gr\"otschla; Luca A. Lanzend\"orfer; June Young Yi; Changho Choi; Roger Wattenhofer

arXiv:2510.22795·cs.SD·October 28, 2025

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Michael Ungersb\"ock, Florian Gr\"otschla, Luca A. Lanzend\"orfer, June Young Yi, Changho Choi, Roger Wattenhofer

PDF

1 Models 1 Video

TL;DR

SAO-Instruct is a novel model that enables flexible, natural language-based editing of audio clips, trained on a new dataset, and demonstrates strong performance both objectively and subjectively.

Contribution

The paper introduces SAO-Instruct, the first model capable of free-form natural language audio editing, along with a new dataset and training pipeline for this task.

Findings

01

SAO-Instruct outperforms existing methods in subjective listening tests.

02

The model generalizes well to real-world audio and unseen instructions.

03

It achieves competitive objective metrics on audio editing benchmarks.

Abstract

Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
disco-eth/sao-instruct
model· 258 dl· ♡ 4
258 dl♡ 4

Videos

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions· slideslive