MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production
Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

TL;DR
This paper introduces MS2SL, a unified diffusion-based framework that generates continuous sign language sequences from spoken content by leveraging a joint embedding space for text, speech, and signs, improving communication accessibility.
Contribution
The paper presents a novel multimodal diffusion model with embedding-consistency learning for sign language production directly from text or speech.
Findings
Achieves competitive performance on How2Sign and PHOENIX14T datasets.
Utilizes a joint embedding space to unify text, speech, and sign modalities.
Reduces reliance on sign triplets through embedding-consistency training.
Abstract
Sign language understanding has made significant strides; however, there is still no viable solution for generating sign sequences directly from entire spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. In particular, a sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Moreover, by creating a joint embedding space for text, audio, and sign, we bind these modalities and leverage the semantic consistency among them to provide informative feedback for the model training. This embedding-consistency learning strategy minimizes the reliance on sign triplets and ensures continuous model refinement, even with a missing audio modality. Experiments on How2Sign and PHOENIX14T…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Speech and dialogue systems
MethodsDiffusion
