EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

Vassilis Sioros; Alexandros Potamianos; Giorgos Paraskevopoulos

arXiv:2507.11096·cs.SD·July 16, 2025

EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

Vassilis Sioros, Alexandros Potamianos, Giorgos Paraskevopoulos

PDF

Open Access

TL;DR

This paper introduces EditGen, a novel method for instruction-based audio editing using cross-attention control in auto-regressive models, combining prompt-guided editing with diffusion and MUSICGEN models to improve musical audio quality.

Contribution

We propose a new cross-attention control approach for auto-regressive audio editing, integrating diffusion strategies and MUSICGEN for enhanced prompt-guided modifications.

Findings

01

Outperforms diffusion-based baseline in melody, dynamics, and tempo.

02

Achieves high controllability and adherence to global text cues.

03

Demonstrates effectiveness through automatic and human evaluations.

Abstract

In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis