MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model
Ye Tao, Wen Wu, Chao Zhang, Mengyue Wu, Shuai Wang, Xuenan Xu

TL;DR
MMEdit introduces a comprehensive, scalable framework for multi-type audio editing that leverages audio-language models to improve localization, instruction adherence, and fidelity across diverse editing operations.
Contribution
It extends audio editing task definitions, develops a large-scale paired dataset, and integrates advanced cross-modal models for precise, flexible audio modifications.
Findings
Achieves superior localization accuracy in editing tasks
Demonstrates robust adherence to editing instructions
Maintains high fidelity in non-edited audio regions
Abstract
Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, while training-based methods, although achieving higher generation quality, are severely constrained by the scarcity of high-quality paired data and task formulations that cover only a narrow subset of editing operations. In addition, standard architectures typically decouple text and audio processing, limiting the ability to align instructions with specific acoustic contexts. To address these challenges, we propose MMEdit, an audio-language-model-driven framework for unified audio editing. We systematically extend task definitions to cover a comprehensive range of editing operations, including addition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis
