USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

Ahmed Abul Hasanaath; Hamzah Luqman

arXiv:2512.13415·cs.CV·December 30, 2025

USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

Ahmed Abul Hasanaath, Hamzah Luqman

PDF

Open Access

TL;DR

The paper introduces USTM, a novel spatio-temporal encoder using a Swin Transformer backbone with lightweight temporal adaptation, achieving state-of-the-art continuous sign language recognition from RGB videos.

Contribution

It presents the USTM framework that effectively models complex spatial and temporal patterns for CSLR without auxiliary modalities, outperforming existing methods.

Findings

01

USTM achieves state-of-the-art results on benchmark datasets.

02

The framework captures fine-grained spatial and long-range temporal features.

03

USTM performs competitively against multi-stream approaches.

Abstract

Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Interactive and Immersive Displays