Learning to Compute the Articulatory Representations of Speech with the MIRRORNET
Yashish M. Siriwardena, Carol Espy-Wilson, Shihab Shamma

TL;DR
This paper introduces MirrorNet, an autoencoder-based model inspired by sensorimotor learning, capable of synthesizing speech and learning articulatory representations with minimal supervised data, matching supervised systems' accuracy.
Contribution
The work presents a novel autoencoder architecture that learns articulatory speech representations with limited ground-truth data, bridging supervised and unsupervised learning in speech synthesis.
Findings
MirrorNet can synthesize speech for unseen speakers.
It learns meaningful articulatory representations with minimal data.
Achieves accuracy comparable to fully supervised systems.
Abstract
Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor mappings proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to control an articulatory synthesizer, with minimal exposure to ground-truth articulatory data. The articulatory synthesizer takes as input a set of six vocal Tract Variables (TVs) and source features (voicing indicators and pitch) and is able to synthesize continuous speech for unseen speakers. We show that the MirrorNet, once initialized (with ~30 mins of articulatory data) and further trained in unsupervised fashion (`learning phase'), can learn meaningful articulatory representations with comparable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
