MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production

Jian Ma; Wenguan Wang; Yi Yang; Feng Zheng

arXiv:2407.12842·cs.CL·July 19, 2024

MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production

Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

PDF

Open Access 1 Video

TL;DR

This paper introduces MS2SL, a unified diffusion-based framework that generates continuous sign language sequences from spoken content by leveraging a joint embedding space for text, speech, and signs, improving communication accessibility.

Contribution

The paper presents a novel multimodal diffusion model with embedding-consistency learning for sign language production directly from text or speech.

Findings

01

Achieves competitive performance on How2Sign and PHOENIX14T datasets.

02

Utilizes a joint embedding space to unify text, speech, and sign modalities.

03

Reduces reliance on sign triplets through embedding-consistency training.

Abstract

Sign language understanding has made significant strides; however, there is still no viable solution for generating sign sequences directly from entire spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. In particular, a sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Moreover, by creating a joint embedding space for text, audio, and sign, we bind these modalities and leverage the semantic consistency among them to provide informative feedback for the model training. This embedding-consistency learning strategy minimizes the reliance on sign triplets and ensures continuous model refinement, even with a missing audio modality. Experiments on How2Sign and PHOENIX14T…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production· underline

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Speech and dialogue systems

MethodsDiffusion