Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis
Ingmar Steiner (INRIA Lorraine - LORIA, Trinity College Dublin), Korin, Richmond (CSTR), Slim Ouni (INRIA Lorraine - LORIA)

TL;DR
This paper explores the use of multimodal speech production data to enhance the animation of intraoral articulators, aiming to improve audiovisual speech synthesis quality by integrating detailed articulatory modeling.
Contribution
It introduces a data-driven approach to animate intraoral articulators using multimodal speech production data, advancing beyond simple rule-based methods.
Findings
Multimodal data improves articulatory animation quality.
Enhanced intraoral articulator modeling leads to more realistic AV speech.
Data-driven methods outperform traditional viseme morphing techniques.
Abstract
The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in stark contrast to the otherwise high quality of facial modeling. Using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing
