An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech
Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu

TL;DR
This paper introduces a three-stage model combining self-supervised contrastive learning, fine-tuning with limited labeled data, and sequential feature learning for hand shape recognition in Cued Speech, achieving high accuracy and dataset expansion.
Contribution
It proposes a novel multi-stage approach integrating self-supervised contrastive learning and sequential modeling for improved CS hand shape feature extraction.
Findings
Achieved over 90% accuracy in hand shape recognition.
Improved CS phoneme recognition correctness by over 8-10%.
Built a new British English CS dataset with 5 native speakers.
Abstract
Cued Speech (CS) is a communication system for deaf people or hearing impaired people, in which a speaker uses it to aid a lipreader in phonetic level by clarifying potentially ambiguous mouth movements with hand shape and positions. Feature extraction of multi-modal CS is a key step in CS recognition. Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality. In this work, we first propose a self-supervised contrastive learning method to learn the feature representation of image without using labels. Secondly, a small amount of manually annotated CS data are used to fine-tune the first module. Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information. Besides, to enlarge the volume and the diversity of the current limited CS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Subtitles and Audiovisual Media
