Cross-Modal Mutual Learning for Cued Speech Recognition
Lei Liu, Li Liu

TL;DR
This paper introduces a transformer-based cross-modal mutual learning framework for automatic Cued Speech Recognition, effectively handling asynchronous lip and hand gesture modalities, and establishes a large-scale multi-speaker dataset for Mandarin Chinese.
Contribution
It proposes a novel cross-modal mutual learning approach with a modality-invariant codebook, and creates the first large-scale multi-speaker dataset for Mandarin Chinese ACSR.
Findings
Our model outperforms state-of-the-art methods significantly.
The approach effectively synchronizes asynchronous multi-modal data.
Extensive experiments validate the model's superior performance across languages.
Abstract
Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movements and hand gestures to code spoken language for hearing-impaired people. Previous ACSR approaches often utilize direct feature concatenation as the main fusion paradigm. However, the asynchronous modalities i.e., lip, hand shape and hand position) in CS may cause interference for feature concatenation. To address this challenge, we propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction. Compared with the vanilla self-attention, our model forces modality-specific information of different modalities to pass through a modality-invariant codebook, collating linguistic representations for tokens of each modality. Then the shared linguistic knowledge is used to re-synchronize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Speech and Audio Processing · Hearing Impairment and Communication
