SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation
Xu Wang, Shengeng Tang, Lechao Cheng, Feng Li, Shuo Wang, Richang Hong

TL;DR
SignAligner is a novel multi-stage framework that enhances sign language generation by integrating text semantics, correcting multimodal pose representations online, and synthesizing realistic sign videos, thereby improving accuracy and expressiveness.
Contribution
The paper introduces SignAligner, a new method that combines text-driven pose co-generation, online multimodal correction, and video synthesis for more natural sign language generation.
Findings
SignAligner outperforms existing methods in accuracy.
It produces more expressive and coherent sign language videos.
The approach effectively integrates multimodal pose information.
Abstract
Sign language generation aims to produce diverse sign representations based on spoken language. However, achieving realistic and naturalistic generation remains a significant challenge due to the complexity of sign language, which encompasses intricate hand gestures, facial expressions, and body movements. In this work, we introduce PHOENIX14T+, an extended version of the widely-used RWTH-PHOENIX-Weather 2014T dataset, featuring three new sign representations: Pose, Hamer and Smplerx. We also propose a novel method, SignAligner, for realistic sign language generation, consisting of three stages: text-driven pose modalities co-generation, online collaborative correction of multimodality, and realistic sign video synthesis. First, by incorporating text semantics, we design a joint sign language generator to simultaneously produce posture coordinates, gesture actions, and body movements.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. SignAligner significantly outperforms existing state-of-the-art approaches on both the PHOENIX14T and CSL-daily datasets. On the PHOENIX14T test set, it achieves superior scores in semantic accuracy (e.g., 20.56 BLEU-1, 8.17 BLEU-4) and visual quality (e.g., 0.731 SSIM, 26.257 FID). 2. The paper construct a dataset with three modalities, whose quality is validated by a robust user study involving 100 volunteers, which found SignAligner's videos to be markedly better in naturalness, temporal
1. Paper details need clarification. For example, the sentences from line 168 to 173 are hard to understand. Variables such as n should be in math form in latex. In line 266, the verb should be "contrain". 2. The proposed method lacks novelty. The dataset is just contructed by leveraging existing techniques to extract pose, Hamer,and Smplerx for two sign language datasets. The proposed method leverages the extracted three modalities with simple feature reconstruction and cross-atttention-based f
The motivation is clear: single-modality or multi-stage pipelines lead to semantic and spatiotemporal consistency issues, while joint modeling with online correction can mitigate them. The framework is well structured; combining three-modality joint generation with OCC is a reasonable technical path. Experiments cover two common datasets, report both semantic and visual metrics, and include ablations with stable and sizable gains. The dataset expansion scheme may provide reusable supervision for
(1) Lack of quantified error propagation and robustness: all three representations introduce errors during acquisition and generation. The paper does not provide systematic noise injection tests or small-scale human-calibrated comparisons, so it is unclear how errors are amplified through the pipeline or which representation is most sensitive. (2) Limited datasets and benchmarks: results are mainly on PHOENIX14T and CSL-daily; larger datasets with native keypoint/hand annotations such as How2Sig
1. The authors extend the PHOENIX-14T and CSL-daily datasets by extracting and providing DWPose poses and HaMeR, SMPLer-X meshes, which can contribute to future SLP work. 2. Each modality alone is imperfect, hence their combination helps in achieving better results. 3. The paper proposes a new alignment strategy between modalities, where they use a different modality for each of the queries, keys, and values.
1. Novelty is limited. Most of the components were proposed in prior work, and the only new component is the collaborative correction, a cross-attention with different Q/K/V, which is neither explained, motivated, nor validated as better than other approaches. 2. The paper has many typos and problematic citations, which make it hard to follow. See 1. below for examples. 3. Many irrelevant details and not enough relevant details, see 2. below. 4. Extraction quality discussion is unclear, see 4. b
1. Novel approach: The approach of harmonizing multiple pose modalities (Pose, Hamer, and Smplerx) for sign language generation is innovative and addresses key challenges in producing coherent and natural sign language videos. 2. Valuable dataset extension: Enriches two benchmarks, PHOENIX14T and CSL-daily with high-fidelity modalities including pose, hamer and smplerx , filling gaps in existing SLG data which only include videos and basic skeletons before . 3. Comprehensive experiments: This
1. No hyperparameter sensitivity analysis: Key parameters (OCC’s α/β/γ, Transformer hidden size/attention heads) lack impact analysis, harming reproducibility . 2. Insufficient framework ablation: Fails to isolate contributions of single stages (e.g., co-gen + synthesis without OCC) to confirm three-stage necessity . 3. Related work: While the related work section provides a solid overview of previous methods, it is recommended to conduct a more detailed comparison between the contributions o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Tactile and Sensory Interactions
