Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE
Marc-Antoine Georges, Jean-Luc Schwartz, Thomas Hueber

TL;DR
This study explores how articulatory and acoustic features can be combined using VQ-VAE to discover speech units in a self-supervised way, revealing complementary phonetic information and improving representation accuracy.
Contribution
It introduces a novel approach using VQ-VAE to fuse articulatory and acoustic data for self-supervised speech unit discovery, highlighting the benefits of multimodal integration.
Findings
Articulatory info organizes latent space by place of articulation.
Acoustic info structures latent space by manner of articulation.
Fusion of modalities yields more accurate phonetic representations.
Abstract
The human perception system is often assumed to recruit motor knowledge when processing auditory speech inputs. Using articulatory modeling and deep learning, this study examines how this articulatory information can be used for discovering speech units in a self-supervised setting. We used vector-quantized variational autoencoders (VQ-VAE) to learn discrete representations from articulatory and acoustic speech data. In line with the zero-resource paradigm, an ABX test was then used to investigate how the extracted representations encode phonetically relevant properties. Experiments were conducted on three different corpora in English and French. We found that articulatory information rather organises the latent representations in terms of place of articulation whereas the speech acoustics mainly structure the latent space in terms of manner of articulation. We show that an optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
MethodsTest
