SignDiff: Diffusion Model for American Sign Language Production

Sen Fang; Chunyu Sui; Yanghao Zhou; Xuedong Zhang; Hongbin Zhong,; Yapeng Tian; Chen Chen

arXiv:2308.16082·cs.CV·May 1, 2025·6 cites

SignDiff: Diffusion Model for American Sign Language Production

Sen Fang, Chunyu Sui, Yanghao Zhou, Xuedong Zhang, Hongbin Zhong,, Yapeng Tian, Chen Chen

PDF

Open Access 1 Datasets

TL;DR

SignDiff is a novel diffusion-based model that generates American Sign Language skeletal videos from text, improving accuracy and quality through innovative modules and achieving state-of-the-art results on multiple datasets.

Contribution

The paper introduces SignDiff, the first diffusion model for ASL production, with a new Frame Reinforcement Network and improved training methods for high-quality sign language synthesis.

Findings

01

Achieved SOTA BLEU-4 scores on How2Sign dataset

02

Outperformed previous methods on PHOENIX14T dataset

03

Image quality exceeds previous results by 10 percentage points in SSIM

Abstract

In this paper, we propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose. SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to dense human pose estimation work, which enhances the correspondence between text lexical symbols and sign language dense pose frames, reduces the occurrence of multiple fingers in the diffusion model. In addition, we propose a new method for American Sign Language Production (ASLP), which can generate ASL skeletal pose videos from text input, integrating two new improved modules and a new loss function to improve the accuracy and quality of sign language skeletal posture and enhance the ability of the model to train on large-scale data. We propose the first baseline for ASL production and report the scores of 17.19 and 12.85 on BLEU-4 on the How2Sign dev/test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FangSen9000/SignDiff
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition

Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide) · Diffusion