Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining
Neena Aloysius, Geetha M, Prema Nedungadi

TL;DR
This paper introduces ConSignformer, a novel vision-based sign language recognition model that adapts the Conformer architecture with unsupervised pretraining and cross-modal attention, achieving state-of-the-art results.
Contribution
It is the first to adapt Conformer for vision-based CSLR, integrating unsupervised pretraining and a new attention mechanism for improved recognition.
Findings
Achieves state-of-the-art performance on PHOENIX datasets.
Demonstrates the effectiveness of unsupervised pretraining.
Shows that Cross-Modal Relative Attention enhances recognition accuracy.
Abstract
Conventional Deep Learning frameworks for continuous sign language recognition (CSLR) are comprised of a single or multi-modal feature extractor, a sequence-learning module, and a decoder for outputting the glosses. The sequence learning module is a crucial part wherein transformers have demonstrated their efficacy in the sequence-to-sequence tasks. Analyzing the research progress in the field of Natural Language Processing and Speech Recognition, a rapid introduction of various transformer variants is observed. However, in the realm of sign language, experimentation in the sequence learning component is limited. In this work, the state-of-the-art Conformer model for Speech Recognition is adapted for CSLR and the proposed model is termed ConSignformer. This marks the first instance of employing Conformer for a vision-based task. ConSignformer has bimodal pipeline of CNN as feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Gait Recognition and Analysis · Human Pose and Action Recognition
