MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation
Ronglai Zuo, Rolandos Alexandros Potamias, Qi Sun, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou

TL;DR
MaDiS is a novel masked diffusion model for sign language generation that captures bidirectional dependencies, enables efficient parallel generation, and leverages multi-level sign representations, significantly improving performance and speed.
Contribution
The paper introduces MaDiS, a masked diffusion-based language model with a tri-level pretraining scheme and a new unmasking strategy, advancing sign language generation beyond autoregressive models.
Findings
Achieves superior performance on CSL-Daily, Phoenix-2014T, and How2Sign datasets.
Demonstrates 40% higher throughput compared to previous methods.
Outperforms existing models on DTW error, SiBLEU, and SiCLIP metrics.
Abstract
Sign language generation (SLG) aims to translate written texts into expressive sign motions, bridging communication barriers for the Deaf and Hard-of-Hearing communities. Recent studies formulate SLG within the language modeling framework using autoregressive language models, which suffer from unidirectional context modeling and slow token-by-token inference. To address these limitations, we present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional dependencies and supports efficient parallel multi-token generation. We further introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-, and 3D physical-space objectives to leverage complementary, multi-level sign representations. To accelerate model convergence in the fine-tuning stage, we design a novel unmasking strategy with temporal checkpoints, which restructures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Interactive and Immersive Displays
