Token-Based Audio Inpainting via Discrete Diffusion
Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani

TL;DR
This paper presents a novel discrete diffusion approach over tokenized music representations for audio inpainting, enabling stable and coherent restoration of long missing segments in musical recordings.
Contribution
It introduces the first discrete diffusion method for audio inpainting using pre-trained tokenizers, with new training techniques for improved performance on long gaps.
Findings
Outperforms existing methods on MusicNet and MAESTRO datasets
Effective for gaps of 150 ms and longer
Advances in musical audio restoration techniques
Abstract
Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion…
Peer Reviews
Decision·ICLR 2026 Poster
This paper presents a token-based diffusion model for audio inpainting (AIDD) that operates directly in the discrete token space rather than waveform or spectrogram domains. The idea is original in its formulation and addresses a practical limitation of prior work—difficulty maintaining long-range temporal and semantic consistency when filling large gaps. The proposed span-based masking and derivative regularization are intuitive yet effective design choices that align well with the inpainting o
(1) Codec choice not sufficiently justified. - The method relies entirely on WavTokenizer, but there are other single-codebook codecs such as UniCodec [1] that could equally serve this purpose. The paper does not explain why WavTokenizer was chosen or whether the improvements are specific to that tokenizer. A small ablation with an alternative codec would help isolate the contribution of the proposed diffusion mechanism. (2) No human evaluation. - The paper claims that AIDD produces perceptuall
1.The paper is the first to apply the discrete diffusion on tokenized representations for audio inpainting. 2.The method achieves state-of-the-art results on the long-gap audio inpainting task. 3.The code will be open-sourced.
1.The evaluation lacks a subjective listening study, which is essential to validate the perceptual quality and musical plausibility of the results. 2.The paper should quantify information loss from tokenization by reporting metrics on both the original audio (as a reference ceiling) and the reconstructed audio (audio passed through the tokenizer's encoder-decoder). This would clarify the tokenizer's impact and establish the method's practical upper bound. 3.The audio sampling rates are not rep
Well-designed system capable of handling inpainting gaps up to approximately 500ms.
The quality is heavily depends on the tokenizer or codec used. The method lacks evaluation outside the music domain and does not consider additional conditions for music restoration. No subjective measurements are provided, and demo samples show noticeable boundary artifacts.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Digital Media Forensic Detection
