UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
Jinting Wang, Shan Yang, Chenxing Li, Dong Yu, Li Liu

TL;DR
UniCUE is a novel unified framework that directly converts Chinese Cued Speech videos into speech, integrating recognition and generation to improve accuracy and reduce errors compared to traditional pipeline methods.
Contribution
It introduces the first unified model for CSV2S that combines understanding and generation, along with a large-scale dataset for Mandarin CS.
Findings
Achieves state-of-the-art performance on UniCUE-HI dataset.
Effectively integrates recognition and speech generation tasks.
Reduces error propagation compared to pipeline approaches.
Abstract
Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
MethodsAdapter
