Human Part-wise 3D Motion Context Learning for Sign Language Recognition
Taeryung Lee, Yeonguk Oh, Kyoung Mu Lee

TL;DR
This paper introduces P3D, a novel framework for sign language recognition that leverages part-wise motion context learning and joint 2D-3D pose ensemble, achieving superior accuracy over previous methods.
Contribution
The paper presents a new part-wise motion context learning approach and the first pose ensemble method combining 2D and 3D data for sign language recognition.
Findings
P3D outperforms previous state-of-the-art methods on WLASL dataset.
Part-wise motion context encoding improves recognition accuracy.
Ensembling 2D and 3D poses enhances the model's ability to distinguish signs.
Abstract
In this paper, we propose P3D, the human part-wise motion context learning framework for sign language recognition. Our main contributions lie in two dimensions: learning the part-wise motion context and employing the pose ensemble to utilize 2D and 3D pose jointly. First, our empirical observation implies that part-wise context encoding benefits the performance of sign language recognition. While previous methods of sign language recognition learned motion context from the sequence of the entire pose, we argue that such methods cannot exploit part-specific motion context. In order to utilize part-wise motion context, we propose the alternating combination of a part-wise encoding Transformer (PET) and a whole-body encoding Transformer (WET). PET encodes the motion contexts from a part sequence, while WET merges them into a unified context. By learning part-wise motion context, our P3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Gait Recognition and Analysis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Absolute Position Encodings · Residual Connection
