Is Self-Supervised Learning Enough to Fill in the Gap? A Study on Speech Inpainting
Ihab Asaad, Maxime Jacquelin, Olivier Perrotin, Laurent Girin, Thomas Hueber

TL;DR
This paper explores using self-supervised learning (SSL) speech encoders for speech inpainting without additional training, comparing fine-tuning strategies and demonstrating successful reconstruction of missing speech segments across various conditions.
Contribution
The study shows that SSL-trained speech encoders can be effectively used for speech inpainting without extra training, highlighting the transferability of SSL pretext tasks to this application.
Findings
SSL-based methods outperform baselines in speech reconstruction
Fine-tuning the encoder improves single-speaker inpainting accuracy
Pre-trained encoder is more effective for multi-speaker scenarios
Abstract
Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task -- here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder's output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder's input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsInpainting
