MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model

Ye Tao; Wen Wu; Chao Zhang; Mengyue Wu; Shuai Wang; Xuenan Xu

arXiv:2512.20339·cs.SD·January 21, 2026

MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model

Ye Tao, Wen Wu, Chao Zhang, Mengyue Wu, Shuai Wang, Xuenan Xu

PDF

Open Access

TL;DR

MMEdit introduces a comprehensive, scalable framework for multi-type audio editing that leverages audio-language models to improve localization, instruction adherence, and fidelity across diverse editing operations.

Contribution

It extends audio editing task definitions, develops a large-scale paired dataset, and integrates advanced cross-modal models for precise, flexible audio modifications.

Findings

01

Achieves superior localization accuracy in editing tasks

02

Demonstrates robust adherence to editing instructions

03

Maintains high fidelity in non-edited audio regions

Abstract

Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, while training-based methods, although achieving higher generation quality, are severely constrained by the scarcity of high-quality paired data and task formulations that cover only a narrow subset of editing operations. In addition, standard architectures typically decouple text and audio processing, limiting the ability to align instructions with specific acoustic contexts. To address these challenges, we propose MMEdit, an audio-language-model-driven framework for unified audio editing. We systematically extend task definitions to cover a comprehensive range of editing operations, including addition,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis