Computation and Parameter Efficient Multi-Modal Fusion Transformer for   Cued Speech Recognition

Lei Liu; Li Liu; Haizhou Li

arXiv:2401.17604·cs.CV·February 9, 2024·1 cites

Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition

Lei Liu, Li Liu, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces EcoCued, a novel, efficient multi-modal transformer with a token-importance mechanism for improved cued speech recognition, capturing global dependencies with reduced computation and enhanced accuracy.

Contribution

The paper proposes a new Token-Importance-Aware Attention mechanism and an economical fusion transformer, significantly improving recognition accuracy and efficiency in multi-modal cued speech recognition.

Findings

01

EcoCued achieves state-of-the-art results on CS datasets.

02

The TIAA mechanism effectively selects important tokens for better fusion.

03

EcoCued reduces computational cost compared to existing methods.

Abstract

Cued Speech (CS) is a pure visual coding method used by hearing-impaired people that combines lip reading with several specific hand shapes to make the spoken language visible. Automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text, which can help hearing-impaired people to communicate effectively. The visual information of CS contains lip reading and hand cueing, thus the fusion of them plays an important role in ACSR. However, most previous fusion methods struggle to capture the global dependency present in long sequence inputs of multi-modal CS data. As a result, these methods generally fail to learn the effective cross-modal relationships that contribute to the fusion. Recently, attention-based transformers have been a prevalent idea for capturing the global dependency over the long sequence in multi-modal fusion, but existing multi-modal fusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam · Softmax · Dense Connections