Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs
Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, Yujie Chen

TL;DR
This paper introduces a unified speech editing detection and content localization framework using Audio LLMs, supported by a new realistic dataset and innovative prompting and loss strategies.
Contribution
It presents AiEdit, a comprehensive bilingual dataset, and reformulates SED as a structured text generation task with prior-enhanced prompting and acoustic consistency loss.
Findings
Outperforms existing methods in detection accuracy.
Effectively handles addition, deletion, and modification edits.
Improves joint reasoning over edit type and content localization.
Abstract
Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame-level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion-type edits, where the manipulated content is entirely absent from the signal. To address these challenges, we present a unified framework that bridges speech editing detection and content localization through a generative formulation based on Audio Large Language Models (Audio LLMs). We first introduce AiEdit, a large-scale bilingual dataset (approximately 140 hours) that covers addition, deletion, and modification operations using state-of-the-art end-to-end speech editing systems, providing a more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
