Two-Stream Network for Sign Language Recognition and Translation
Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, Brian Mak

TL;DR
This paper introduces a dual-stream neural network architecture for sign language recognition and translation that effectively leverages both raw video data and keypoint information, achieving state-of-the-art results.
Contribution
The paper proposes a novel TwoStream network that models raw videos and keypoints separately with interactive modules, improving sign language understanding and translation performance.
Findings
Achieves state-of-the-art results on multiple datasets
Effectively models both visual and keypoint information
Demonstrates the benefit of dual-stream interaction mechanisms
Abstract
Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding. To mitigate this problem and better incorporate domain knowledge, such as handshape and body movement, we introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences generated by an off-the-shelf keypoint estimator. To make the two streams interact with each other, we explore a variety of techniques, including bidirectional lateral connection, sign pyramid network with auxiliary supervision, and frame-level self-distillation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHand Gesture Recognition Systems · Gait Recognition and Analysis · Human Pose and Action Recognition
MethodsSurrogate Lagrangian Relaxation
