Dual-view Spatio-Temporal Feature Fusion with CNN-Transformer Hybrid Network for Chinese Isolated Sign Language Recognition

Siyuan Jing; Guangxue Wang; Haoyang Zhai; Qin Tao; Jun Yang; Bing Wang; Peng Jin

arXiv:2506.06966·cs.CV·June 10, 2025

Dual-view Spatio-Temporal Feature Fusion with CNN-Transformer Hybrid Network for Chinese Isolated Sign Language Recognition

Siyuan Jing, Guangxue Wang, Haoyang Zhai, Qin Tao, Jun Yang, Bing Wang, Peng Jin

PDF

Open Access

TL;DR

This paper introduces a comprehensive dual-view Chinese sign language dataset and a CNN-Transformer hybrid network with a fusion strategy to improve isolated sign language recognition, addressing occlusion and vocabulary coverage challenges.

Contribution

It presents a new dual-view sign language dataset covering the full Chinese sign vocabulary and proposes a CNN-Transformer model with an effective fusion method for ISLR.

Findings

01

Fusion strategy improves recognition performance

02

Dual-view dataset covers complete Chinese sign vocabulary

03

Sequence-to-sequence models struggle to learn complementary features

Abstract

Due to the emergence of many sign language datasets, isolated sign language recognition (ISLR) has made significant progress in recent years. In addition, the development of various advanced deep neural networks is another reason for this breakthrough. However, challenges remain in applying the technique in the real world. First, existing sign language datasets do not cover the whole sign vocabulary. Second, most of the sign language datasets provide only single view RGB videos, which makes it difficult to handle hand occlusions when performing ISLR. To fill this gap, this paper presents a dual-view sign language dataset for ISLR named NationalCSL-DP, which fully covers the Chinese national sign language vocabulary. The dataset consists of 134140 sign videos recorded by ten signers with respect to two vertical views, namely, the front side and the left side. Furthermore, a CNN…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Hearing Impairment and Communication