AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition
Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen, Assane Gueye

TL;DR
AutoSign introduces a novel autoregressive transformer model that directly translates pose sequences into text for continuous sign language recognition, overcoming traditional pipeline limitations and improving accuracy.
Contribution
It proposes a direct pose-to-text translation model using a decoder-only transformer, bypassing alignment-based methods and enhancing signer-independent CSLR performance.
Findings
Achieves up to 6.1% WER improvement on Isharah-1000 dataset.
Demonstrates the effectiveness of hand and body gestures for signer-independent recognition.
Eliminates multi-stage pipelines, reducing error propagation.
Abstract
Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Face recognition and analysis
