Is Self-Supervised Learning Enough to Fill in the Gap? A Study on Speech Inpainting

Ihab Asaad; Maxime Jacquelin; Olivier Perrotin; Laurent Girin; Thomas Hueber

arXiv:2405.20101·cs.SD·December 9, 2025

Is Self-Supervised Learning Enough to Fill in the Gap? A Study on Speech Inpainting

Ihab Asaad, Maxime Jacquelin, Olivier Perrotin, Laurent Girin, Thomas Hueber

PDF

Open Access

TL;DR

This paper explores using self-supervised learning (SSL) speech encoders for speech inpainting without additional training, comparing fine-tuning strategies and demonstrating successful reconstruction of missing speech segments across various conditions.

Contribution

The study shows that SSL-trained speech encoders can be effectively used for speech inpainting without extra training, highlighting the transferability of SSL pretext tasks to this application.

Findings

01

SSL-based methods outperform baselines in speech reconstruction

02

Fine-tuning the encoder improves single-speaker inpainting accuracy

03

Pre-trained encoder is more effective for multi-speaker scenarios

Abstract

Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task -- here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder's output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder's input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsInpainting