SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces
Ivan Vall\'es-P\'erez, Grzegorz Beringer, Piotr Bilinski, Gary Cook,, Roberto Barra-Chicote

TL;DR
This paper introduces SCRAPS, a CLIP-inspired model for learning shared representations of phonetic and acoustic spaces in speech, demonstrating robustness to noise and usefulness for downstream speech tasks.
Contribution
It applies CLIP-like contrastive learning to speech, creating shared phonetic-acoustic embeddings with improved robustness and downstream applicability.
Findings
91% score drop when replacing 20% of phonemes
10% performance drop with 75% Gaussian noise
Embeddings useful for intelligibility and speech generation
Abstract
Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsContrastive Language-Image Pre-training
