SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic   Spaces

Ivan Vall\'es-P\'erez; Grzegorz Beringer; Piotr Bilinski; Gary Cook,; Roberto Barra-Chicote

arXiv:2307.12445·cs.SD·February 1, 2024·1 cites

SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

Ivan Vall\'es-P\'erez, Grzegorz Beringer, Piotr Bilinski, Gary Cook,, Roberto Barra-Chicote

PDF

Open Access

TL;DR

This paper introduces SCRAPS, a CLIP-inspired model for learning shared representations of phonetic and acoustic spaces in speech, demonstrating robustness to noise and usefulness for downstream speech tasks.

Contribution

It applies CLIP-like contrastive learning to speech, creating shared phonetic-acoustic embeddings with improved robustness and downstream applicability.

Findings

01

91% score drop when replacing 20% of phonemes

02

10% performance drop with 75% Gaussian noise

03

Embeddings useful for intelligibility and speech generation

Abstract

Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsContrastive Language-Image Pre-training