TL;DR
This paper introduces a unified neural model for simultaneous speech and gesture synthesis, improving naturalness and efficiency over traditional pipeline approaches by integrating both modalities into a single system.
Contribution
The authors propose a novel integrated speech and gesture synthesis model based on modified neural speech-synthesis engines, demonstrating comparable quality with faster synthesis and fewer parameters.
Findings
Participants rated the integrated model as comparable to state-of-the-art pipeline systems.
The integrated model achieved faster synthesis times.
The model used significantly fewer parameters than traditional pipelines.
Abstract
Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications -- speech and gesture presented together. The results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
