Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production
Maoxiao Ye, Xinfeng Ye, Mano Manoharan

TL;DR
This paper introduces a hybrid autoregressive-diffusion model for real-time sign language production, combining strengths of both approaches with novel modules to improve accuracy and efficiency.
Contribution
It proposes a new hybrid model integrating autoregressive and diffusion techniques, along with a Multi-Scale Pose Representation and Confidence-Aware Causal Attention for enhanced real-time sign language synthesis.
Findings
Demonstrates improved generation quality on PHOENIX14T and How2Sign datasets.
Achieves real-time performance suitable for practical applications.
Enhances robustness and accuracy with confidence-guided pose generation.
Abstract
Earlier Sign Language Production (SLP) models typically relied on autoregressive methods that generate output tokens one by one, which inherently provide temporal alignment. Although techniques like Teacher Forcing can prevent model collapse during training, they still cannot solve the problem of error accumulation during inference, since ground truth is unavailable at that stage. In contrast, more recent approaches based on diffusion models leverage step-by-step denoising to enable high-quality generation. However, the iterative nature of these models and the requirement to denoise entire sequences limit their applicability in real-time tasks like SLP. To address it, we explore a hybrid approach that combines autoregressive and diffusion models for SLP, leveraging the strengths of both models in sequential dependency modeling and output refinement. To capture fine-grained body…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Hearing Impairment and Communication
