STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition
Suvajit Patra, Soumitra Samanta

TL;DR
This paper introduces a unified spatio-temporal attention network for continuous sign language recognition that reduces model complexity by 70-80% while maintaining high accuracy, improving efficiency in keypoint-based modeling.
Contribution
The paper proposes a novel spatio-temporal attention mechanism that significantly reduces parameters compared to existing models, with comparable recognition performance.
Findings
Achieves similar accuracy to state-of-the-art models on Phoenix-14T dataset.
Reduces model parameters by approximately 70-80%.
Provides a more efficient approach for keypoint-based CSLR.
Abstract
Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately fewer parameters than existing state-of-the-art models while achieving comparable performance to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition
