MAD: Multi-Alignment MEG-to-Text Decoding
Yiqian Yang, Hyejeong Jo, Yiqun Duan, Qiang Zhang, Jinni Zhou, Xuming Hu, Won Hee Lee, Renjing Xu, Hui Xiong

TL;DR
This paper introduces a novel end-to-end multi-alignment framework for decoding MEG brain signals into text, significantly improving performance on unseen text and advancing brain-computer interface capabilities.
Contribution
It is the first to develop an end-to-end multi-alignment model for MEG-to-text translation, enhancing generalization to unseen linguistic data.
Findings
Achieved BLEU-1 score of 6.86 on GWilliams dataset
Significant performance improvement over baseline (from 5.49 to 6.86)
Demonstrated potential for real-world BCI applications
Abstract
Deciphering language from brain activity is a crucial task in brain-computer interface (BCI) research. Non-invasive cerebral signaling techniques including electroencephalography (EEG) and magnetoencephalography (MEG) are becoming increasingly popular due to their safety and practicality, avoiding invasive electrode implantation. However, current works under-investigated three points: 1) a predominant focus on EEG with limited exploration of MEG, which provides superior signal quality; 2) poor performance on unseen text, indicating the need for models that can better generalize to diverse linguistic contexts; 3) insufficient integration of information from other modalities, which could potentially constrain our capacity to comprehensively understand the intricate dynamics of brain activity. This study presents a novel approach for translating MEG signals into text using a…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- **Clarity**: The paper is exceptionally well-written and easy to understand. - **Novelty**: It introduces a novel multi-alignment approach that utilizes auxiliary modalities, such as Mel spectrograms, to enhance the translation of MEG signals into text. This work is notable for being one of the few that reports results without using teacher forcing. - **Performance**: The method achieves state-of-the-art results on the BLEU-1 metric for this dataset. - **Ablation Studies**: The paper includes
- **Dataset**: The reliance on a single dataset reduces the impact of the results. Although the authors claim good performance on entirely unseen text, the word overlap is 46% on stories belonging to the same corpus and same subjects. This assumption should be validated across other datasets. - **Losses**: The authors indicate that the Le loss is the most important in the study, but they do not explain why. A part-of-speech analysis could clarify the performance differences between content an
1. The incorporation of a new modality in the MEG2Text translation is desirable. 2. High-level L_e loss is proven to be effective.
The residual connection in Figure 1 (b) is missing an arrow. The main experimental results in Table 2 have no advantage over RS and NeuSpeech, and are almost all lower except for B-1 and self-B. The brain module within Wav2vecCTC is not trained on text, so it exhibits poor performance and struggles to generate coherent words. The comparison is unfair. Baseline models should be compared both w/ and w/o tf. The paper aims to showcase the model's ability to generalize to unseen text, yet this a
The paper can be classified as a combination of existing ideas. In terms of architecture, the authors follow the idea of (Yang et al. 2024) and combine a convolutional network with a pre-trained whisper speech decoder; instead of standard convolution layers they propose to use the brain module of (Défossez et al. 2023). As training objectives, they compare combinations of similar (and related) loss functions as proposed in (Défossez et al. 2023) and (Yang et al. 2024). ### Quality I appreciate
### Originality - Brain activity (invasive and non-invasive) to speech/text translation has already been introduced in prior works, as summarized in section 2 (related work). - The methodological contribution is incremental. Défossez et al. (2023) introduced the idea of aligning M/EEG signals with latent representations of a pre-trained ASR model (wav2vec2.0) with the CLIP loss. In the submission, the authors use a related approach (whisper encoder instead of wav2vce2.0, and MMD loss instead o
The approach as a whole is novel and the reported performance is strong across metrics for both high level accuracy and low level semantic content. The inclusion of random gaussian baselines should be noted and adopted as standard for future decoding studies. The push for evaluations on unseen text is also laudable.
There appears to be some issue with citation formatting that should be fixed (see ICLR 2025 style guide), as well as a few grammatical issues. Additionally, the benchmarking comparison with Défossez et al. (2023) appears to be in bad faith. Instead of comparing the performance of their decoding framework against Défossez et al.'s model as originally designed, it seems as if the authors of this paper use only the brain model and then apply a decoding head in the style of their framework. Thus, it
- This paper introduces a new open-vocabulary MEG-to-Text translation framework MAD, avoiding the teacher-forcing problem seen in previous models and using a more reasonable evaluation method. - The experimental design is robust, including comprehensive ablation studies that underscore the advantages of multi-modal alignment. - Explanations of the model architecture, data, and experimental setup are clear and well-structured. - The model sets a new standard in MEG-to-Text decoding accuracy, s
The primary concern is the limited dataset, as the experiments were only conducted on the GWilliams dataset, while the model was trained with MEG-Speech-Text modality pairs. Testing on a single dataset limits the model's robustness and generalizability claims.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Computing and Networks · Image Retrieval and Classification Techniques
MethodsFocus
