Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition
Lei Liu, Li Liu, Haizhou Li

TL;DR
This paper introduces EcoCued, a novel, efficient multi-modal transformer with a token-importance mechanism for improved cued speech recognition, capturing global dependencies with reduced computation and enhanced accuracy.
Contribution
The paper proposes a new Token-Importance-Aware Attention mechanism and an economical fusion transformer, significantly improving recognition accuracy and efficiency in multi-modal cued speech recognition.
Findings
EcoCued achieves state-of-the-art results on CS datasets.
The TIAA mechanism effectively selects important tokens for better fusion.
EcoCued reduces computational cost compared to existing methods.
Abstract
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people that combines lip reading with several specific hand shapes to make the spoken language visible. Automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text, which can help hearing-impaired people to communicate effectively. The visual information of CS contains lip reading and hand cueing, thus the fusion of them plays an important role in ACSR. However, most previous fusion methods struggle to capture the global dependency present in long sequence inputs of multi-modal CS data. As a result, these methods generally fail to learn the effective cross-modal relationships that contribute to the fusion. Recently, attention-based transformers have been a prevalent idea for capturing the global dependency over the long sequence in multi-modal fusion, but existing multi-modal fusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam · Softmax · Dense Connections
