Correcting Mispronunciations in Speech using Spectrogram Inpainting
Talia Ben-Simon, Felix Kreuk, Faten Awwad, Jacob T. Cohen, Joseph, Keshet

TL;DR
This paper introduces a deep learning inpainting system using spectrogram masking to correct mispronunciations in speech, maintaining speaker identity and providing synthetic feedback for language learners and children with pronunciation issues.
Contribution
It presents a novel spectrogram inpainting approach with a U-net architecture to generate corrected speech while preserving speaker voice, trained on proper speech examples.
Findings
Listeners slightly prefer the generated speech over simple phoneme replacement.
The system effectively reconstructs correct pronunciations in minimal pairs.
It shows promise for aiding language learning and speech therapy.
Abstract
Learning a new language involves constantly comparing speech productions with reference productions from the environment. Early in speech acquisition, children make articulatory adjustments to match their caregivers' speech. Grownup learners of a language tweak their speech to match the tutor reference. This paper proposes a method to synthetically generate correct pronunciation feedback given incorrect production. Furthermore, our aim is to generate the corrected production while maintaining the speaker's original voice. The system prompts the user to pronounce a phrase. The speech is recorded, and the samples associated with the inaccurate phoneme are masked with zeros. This waveform serves as an input to a speech generator, implemented as a deep learning inpainting system with a U-net architecture, and trained to output a reconstructed speech. The training set is composed of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Concatenated Skip Connection · Max Pooling · U-Net · Inpainting
