ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos
Razieh Rastgoo, Kourosh Kiani, Sergio Escalera

TL;DR
This paper introduces ZS-SLR, a zero-shot sign language recognition system using RGB-D videos and vision Transformers, achieving state-of-the-art results by mapping visual features to linguistic embeddings.
Contribution
The paper presents a novel two-stream Transformer-based model for zero-shot sign language recognition from RGB-D videos, integrating human detection, segmentation, and semantic mapping.
Findings
Achieved state-of-the-art results on four benchmark datasets.
Effectively mapped visual features to linguistic embeddings.
Demonstrated robustness across multiple sign language datasets.
Abstract
Sign Language Recognition (SLR) is a challenging research area in computer vision. To tackle the annotation bottleneck in SLR, we formulate the problem of Zero-Shot Sign Language Recognition (ZS-SLR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Gait Recognition and Analysis
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Vision Transformer · Label Smoothing · Tanh Activation · Surrogate Lagrangian Relaxation
