Learning to Compute the Articulatory Representations of Speech with the   MIRRORNET

Yashish M. Siriwardena; Carol Espy-Wilson; Shihab Shamma

arXiv:2210.16454·eess.AS·May 26, 2023

Learning to Compute the Articulatory Representations of Speech with the MIRRORNET

Yashish M. Siriwardena, Carol Espy-Wilson, Shihab Shamma

PDF

Open Access

TL;DR

This paper introduces MirrorNet, an autoencoder-based model inspired by sensorimotor learning, capable of synthesizing speech and learning articulatory representations with minimal supervised data, matching supervised systems' accuracy.

Contribution

The work presents a novel autoencoder architecture that learns articulatory speech representations with limited ground-truth data, bridging supervised and unsupervised learning in speech synthesis.

Findings

01

MirrorNet can synthesize speech for unseen speakers.

02

It learns meaningful articulatory representations with minimal data.

03

Achieves accuracy comparable to fully supervised systems.

Abstract

Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor mappings proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to control an articulatory synthesizer, with minimal exposure to ground-truth articulatory data. The articulatory synthesizer takes as input a set of six vocal Tract Variables (TVs) and source features (voicing indicators and pitch) and is able to synthesize continuous speech for unseen speakers. We show that the MirrorNet, once initialized (with ~30 mins of articulatory data) and further trained in unsupervised fashion (`learning phase'), can learn meaningful articulatory representations with comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies