Cross-Modal Mutual Learning for Cued Speech Recognition

Lei Liu; Li Liu

arXiv:2212.01083·cs.CV·February 28, 2023

Cross-Modal Mutual Learning for Cued Speech Recognition

Lei Liu, Li Liu

PDF

Open Access

TL;DR

This paper introduces a transformer-based cross-modal mutual learning framework for automatic Cued Speech Recognition, effectively handling asynchronous lip and hand gesture modalities, and establishes a large-scale multi-speaker dataset for Mandarin Chinese.

Contribution

It proposes a novel cross-modal mutual learning approach with a modality-invariant codebook, and creates the first large-scale multi-speaker dataset for Mandarin Chinese ACSR.

Findings

01

Our model outperforms state-of-the-art methods significantly.

02

The approach effectively synchronizes asynchronous multi-modal data.

03

Extensive experiments validate the model's superior performance across languages.

Abstract

Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movements and hand gestures to code spoken language for hearing-impaired people. Previous ACSR approaches often utilize direct feature concatenation as the main fusion paradigm. However, the asynchronous modalities i.e., lip, hand shape and hand position) in CS may cause interference for feature concatenation. To address this challenge, we propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction. Compared with the vanilla self-attention, our model forces modality-specific information of different modalities to pass through a modality-invariant codebook, collating linguistic representations for tokens of each modality. Then the shared linguistic knowledge is used to re-synchronize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Speech and Audio Processing · Hearing Impairment and Communication