Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition
Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu,, Yuwei Wang, and Yanyan Wei

TL;DR
This paper presents an ensemble learning approach using a Video Swin Transformer to improve cross-view sign language recognition, addressing viewpoint variability and achieving top competition rankings.
Contribution
It introduces a novel ensemble strategy with a multi-dimensional Video Swin Transformer for cross-view sign language recognition, enhancing robustness and generalization.
Findings
Achieved 3rd place in both RGB and RGB-D tracks at WWW 2025 challenge.
Demonstrated improved cross-view recognition performance.
Validated effectiveness of ensemble learning in sign language recognition.
Abstract
In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language from different viewpoints, models must be capable of understanding gestures from multiple angles, making cross-view recognition challenging. To address this, we explore the advantages of ensemble learning, which enhances model robustness and generalization across diverse views. Our approach, built on a multi-dimensional Video Swin Transformer model, leverages this ensemble strategy to achieve competitive performance. Finally, our solution ranked 3rd in both the RGB-based ISLR and RGB-D-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Gait Recognition and Analysis
MethodsAttention Is All You Need · Label Smoothing · Layer Normalization · Stochastic Depth · Linear Layer · Byte Pair Encoding · Dense Connections · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer
