STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Suvajit Patra; Soumitra Samanta

arXiv:2603.16163·cs.CV·March 18, 2026

STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Suvajit Patra, Soumitra Samanta

PDF

Open Access

TL;DR

This paper introduces a unified spatio-temporal attention network for continuous sign language recognition that reduces model complexity by 70-80% while maintaining high accuracy, improving efficiency in keypoint-based modeling.

Contribution

The paper proposes a novel spatio-temporal attention mechanism that significantly reduces parameters compared to existing models, with comparable recognition performance.

Findings

01

Achieves similar accuracy to state-of-the-art models on Phoenix-14T dataset.

02

Reduces model parameters by approximately 70-80%.

03

Provides a more efficient approach for keypoint-based CSLR.

Abstract

Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70 - 80%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Human Pose and Action Recognition