MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction
Mohammed Salah Al-Radhi, G\'eza N\'emeth, Branislav Gerazov

TL;DR
MiSTR is a novel deep-learning framework that synthesizes natural speech from intracranial EEG signals by combining wavelet features, Transformer-based prosody prediction, and neural phase reconstruction, advancing brain-computer speech interfaces.
Contribution
Introduces MiSTR, integrating wavelet features, Transformer-based prosody modeling, and neural phase reconstruction for improved iEEG-to-speech synthesis.
Findings
Achieves state-of-the-art speech intelligibility with a Pearson correlation of 0.91.
Outperforms existing neural speech synthesis baselines.
Demonstrates effective prosody and phase reconstruction from iEEG signals.
Abstract
Speech synthesis from intracranial EEG (iEEG) signals offers a promising avenue for restoring communication in individuals with severe speech impairments. However, achieving intelligible and natural speech remains challenging due to limitations in feature representation, prosody modeling, and phase reconstruction. We introduce MiSTR, a deep-learning framework that integrates: 1) Wavelet-based feature extraction to capture fine-grained temporal, spectral, and neurophysiological representations of iEEG signals, 2) A Transformer-based decoder for prosody-aware spectrogram prediction, and 3) A neural phase vocoder enforcing harmonic consistency via adaptive spectral correction. Evaluated on a public iEEG dataset, MiSTR achieves state-of-the-art speech intelligibility, with a mean Pearson correlation of 0.91 between reconstructed and original Mel spectrograms, improving over existing neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
