Learning Joint Articulatory-Acoustic Representations with Normalizing Flows
Pramit Saha, Sidney Fels

TL;DR
This paper introduces a novel invertible neural network model that learns a joint latent space for articulatory and acoustic speech representations, enabling bidirectional mapping and preserving domain-specific features.
Contribution
It presents a semi-supervised convolutional autoencoder with normalizing flows for joint articulatory-acoustic modeling, a novel approach in speech representation learning.
Findings
Effective bidirectional articulatory-acoustic mapping
Preservation of domain-specific features
Successful joint encoding of articulatory and acoustic data
Abstract
The articulatory geometric configurations of the vocal tract and the acoustic properties of the resultant speech sound are considered to have a strong causal relationship. This paper aims at finding a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models, while simultaneously preserving the respective domain-specific features. Our model utilizes a convolutional autoencoder architecture and normalizing flow-based models to allow both forward and inverse mappings in a semi-supervised manner, between the mid-sagittal vocal tract geometry of a two degrees-of-freedom articulatory synthesizer with 1D acoustic wave model and the Mel-spectrogram representation of the synthesized speech sounds. Our approach achieves satisfactory performance in achieving both articulatory-to-acoustic as well as acoustic-to-articulatory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSolana Customer Service Number +1-833-534-1729
