A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition
Tyler Benster, Guy Wilson, Reshef Elisha, Francis R Willett, Shaul, Druckmann

TL;DR
This paper presents MONA, a cross-modal silent speech recognition system enhanced by LLM scoring, achieving significant reductions in word error rate and demonstrating the viability of noninvasive silent speech interfaces as alternatives to traditional ASR.
Contribution
Introduces MONA with novel loss functions and LISA, enabling silent speech recognition on open vocabulary with state-of-the-art accuracy improvements.
Findings
Reduced silent speech WER from 28.8% to 12.2% on benchmark datasets.
Achieved 3.7% WER on vocal EMG recordings, surpassing previous state-of-the-art.
Performed best in Brain-to-Text 2024 competition, with top WER of 8.9%.
Abstract
Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment through novel loss functions--cross-contrast (crossCon) and supervised temporal contrast (supTcon)--to train a multimodal model with a shared latent representation. This architecture enables the use of audio-only datasets like LibriSpeech to improve silent speech recognition. Additionally, our introduction of Large Language Model (LLM) Integrated Scoring Adjustment (LISA) significantly improves recognition accuracy. Together, MONA LISA reduces the state-of-the-art word error rate (WER) from 28.8% to 12.2% in the Gaddy (2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG recordings, our method improves the state-of-the-art from 23.3%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
