Token-Based Audio Inpainting via Discrete Diffusion

Tali Dror; Iftach Shoham; Moshe Buchris; Oren Gal; Haim Permuter; Gilad Katz; Eliya Nachmani

arXiv:2507.08333·cs.SD·February 18, 2026

Token-Based Audio Inpainting via Discrete Diffusion

Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper presents a novel discrete diffusion approach over tokenized music representations for audio inpainting, enabling stable and coherent restoration of long missing segments in musical recordings.

Contribution

It introduces the first discrete diffusion method for audio inpainting using pre-trained tokenizers, with new training techniques for improved performance on long gaps.

Findings

01

Outperforms existing methods on MusicNet and MAESTRO datasets

02

Effective for gaps of 150 ms and longer

03

Advances in musical audio restoration techniques

Abstract

Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

This paper presents a token-based diffusion model for audio inpainting (AIDD) that operates directly in the discrete token space rather than waveform or spectrogram domains. The idea is original in its formulation and addresses a practical limitation of prior work—difficulty maintaining long-range temporal and semantic consistency when filling large gaps. The proposed span-based masking and derivative regularization are intuitive yet effective design choices that align well with the inpainting o

Weaknesses

(1) Codec choice not sufficiently justified. - The method relies entirely on WavTokenizer, but there are other single-codebook codecs such as UniCodec [1] that could equally serve this purpose. The paper does not explain why WavTokenizer was chosen or whether the improvements are specific to that tokenizer. A small ablation with an alternative codec would help isolate the contribution of the proposed diffusion mechanism. (2) No human evaluation. - The paper claims that AIDD produces perceptuall

Reviewer 02Rating 4Confidence 4

Strengths

1.The paper is the first to apply the discrete diffusion on tokenized representations for audio inpainting. 2.The method achieves state-of-the-art results on the long-gap audio inpainting task. 3.The code will be open-sourced.

Weaknesses

1.The evaluation lacks a subjective listening study, which is essential to validate the perceptual quality and musical plausibility of the results. 2.The paper should quantify information loss from tokenization by reporting metrics on both the original audio (as a reference ceiling) and the reconstructed audio (audio passed through the tokenizer's encoder-decoder). This would clarify the tokenizer's impact and establish the method's practical upper bound. 3.The audio sampling rates are not rep

Reviewer 03Rating 4Confidence 4

Strengths

Well-designed system capable of handling inpainting gaps up to approximately 500ms.

Weaknesses

The quality is heavily depends on the tokenizer or codec used. The method lacks evaluation outside the music domain and does not consider additional conditions for music restoration. No subjective measurements are provided, and demo samples show noticeable boundary artifacts.

Code & Models

Models

🤗
TaliDror/AIDD
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Digital Media Forensic Detection