Spatio-temporal transformer to support automatic sign language translation
Christian Ruiz, Fabio Martinez

TL;DR
This paper presents a spatio-temporal transformer architecture for sign language translation that effectively captures gesture variations and long sequences, outperforming baselines on multiple datasets.
Contribution
It introduces a novel transformer-based model combining convolutional and attention mechanisms for improved sign language translation.
Findings
Achieved BLEU4 of 46.84% on CoL-SLTD
Achieved BLEU4 of 30.77% on PHOENIX14T
Demonstrated robustness in real-world scenarios
Abstract
Sign Language Translation (SLT) systems support hearing-impaired people communication by finding equivalences between signed and spoken languages. This task is however challenging due to multiple sign variations, complexity in language and inherent richness of expressions. Computational approaches have evidenced capabilities to support SLT. Nonetheless, these approaches remain limited to cover gestures variability and support long sequence translations. This paper introduces a Transformer-based architecture that encodes spatio-temporal motion gestures, preserving both local and long-range spatial information through the use of multiple convolutional and attention mechanisms. The proposed approach was validated on the Colombian Sign Language Translation Dataset (CoL-SLTD) outperforming baseline approaches, and achieving a BLEU4 of 46.84%. Additionally, the proposed approach was validated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Robotics and Automated Systems
MethodsSoftmax · Attention Is All You Need
