An Attention Self-supervised Contrastive Learning based Three-stage   Model for Hand Shape Feature Representation in Cued Speech

Jianrong Wang; Nan Gu; Mei Yu; Xuewei Li; Qiang Fang; Li Liu

arXiv:2106.14016·cs.MM·June 29, 2021

An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech

Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu

PDF

Open Access

TL;DR

This paper introduces a three-stage model combining self-supervised contrastive learning, fine-tuning with limited labeled data, and sequential feature learning for hand shape recognition in Cued Speech, achieving high accuracy and dataset expansion.

Contribution

It proposes a novel multi-stage approach integrating self-supervised contrastive learning and sequential modeling for improved CS hand shape feature extraction.

Findings

01

Achieved over 90% accuracy in hand shape recognition.

02

Improved CS phoneme recognition correctness by over 8-10%.

03

Built a new British English CS dataset with 5 native speakers.

Abstract

Cued Speech (CS) is a communication system for deaf people or hearing impaired people, in which a speaker uses it to aid a lipreader in phonetic level by clarifying potentially ambiguous mouth movements with hand shape and positions. Feature extraction of multi-modal CS is a key step in CS recognition. Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality. In this work, we first propose a self-supervised contrastive learning method to learn the feature representation of image without using labels. Secondly, a small amount of manually annotated CS data are used to fine-tune the first module. Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information. Besides, to enlarge the volume and the diversity of the current limited CS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Hearing Impairment and Communication · Subtitles and Audiovisual Media