Cued Speech Generation Leveraging a Pre-trained Audiovisual   Text-to-Speech Model

Sanjana Sankar; Martin Lenglet; Gerard Bailly; Denis Beautemps; Thomas; Hueber

arXiv:2501.04799·cs.CL·January 10, 2025

Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model

Sanjana Sankar, Martin Lenglet, Gerard Bailly, Denis Beautemps, Thomas, Hueber

PDF

Open Access

TL;DR

This paper introduces a method to generate Cued Speech from text by adapting a pre-trained audiovisual TTS model, achieving promising recognition accuracy for aiding communication for the hearing impaired.

Contribution

It reprograms a pre-trained audiovisual TTS model to produce Cued Speech movements from text, a novel application of transfer learning in this domain.

Findings

01

Achieved approximately 77% phonetic decoding accuracy.

02

Validated approach on two datasets, including one recorded for this study.

03

Demonstrated effectiveness of transfer learning for Cued Speech generation.

Abstract

This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogrammed to infer Cued Speech (CS) hand and lip movements from text input. Experiments are conducted on two publicly available datasets, including one recorded specifically for this study. Performance is assessed using an automatic CS recognition system. With a decoding accuracy at the phonetic level reaching approximately 77%, the results demonstrate the effectiveness of our approach.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Speech Recognition and Synthesis