Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition
Vinicius Ribeiro, Yves Laprie

TL;DR
This paper proposes evaluating speech articulation synthesis by using phoneme recognition with articulatory features as a proxy, aiming to better capture phonetic nuances than traditional metrics.
Contribution
It introduces a novel evaluation method leveraging phoneme recognition on articulatory features to assess synthesis quality more effectively.
Findings
Articulatory feature set is phonetically rich.
Phoneme recognition performance varies with different synthetic articulatory features.
The method captures nuances in phoneme production better than traditional metrics.
Abstract
Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
