SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition

Kunyuan Xie; Zhixi Cai; Kalin Stefanov

arXiv:2605.02094·cs.CV·May 5, 2026

SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition

Kunyuan Xie, Zhixi Cai, Kalin Stefanov

PDF

TL;DR

SignMAE introduces a segmentation-driven self-supervised learning approach that enhances fine-grained sign language recognition by focusing on key body parts, achieving state-of-the-art results with fewer input frames.

Contribution

The paper presents a novel segmentation-based masking pretraining method tailored for sign language recognition, improving fine-grained cue capture over generic pretraining.

Findings

01

Achieves state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo datasets.

02

Improves per-instance and per-class Top-1 accuracy.

03

Uses fewer input frames and modalities than comparable methods.

Abstract

Subtle hand differences make sign language recognition challenging, yet many existing methods rely on encoders pretrained on generic action datasets that poorly capture such fine-grained cues. We propose a self-supervised pretraining method for sign language recognition that uses segmentation-based masking to adapt to the presence and motion of key body parts, rather than treating hand poses as static visual tokens. The resulting mask-and-reconstruct objective improves fine-grained sign representation learning. On WLASL, NMFs-CSL, and Slovo, our encoder achieves state-of-the-art performance, improving per-instance and per-class Top-1 accuracy while using fewer input frames and modalities than comparable encoders.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.