USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition
Ahmed Abul Hasanaath, Hamzah Luqman

TL;DR
The paper introduces USTM, a novel spatio-temporal encoder using a Swin Transformer backbone with lightweight temporal adaptation, achieving state-of-the-art continuous sign language recognition from RGB videos.
Contribution
It presents the USTM framework that effectively models complex spatial and temporal patterns for CSLR without auxiliary modalities, outperforming existing methods.
Findings
USTM achieves state-of-the-art results on benchmark datasets.
The framework captures fine-grained spatial and long-range temporal features.
USTM performs competitively against multi-stream approaches.
Abstract
Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Interactive and Immersive Displays
