A multimodal LLM for the non-invasive decoding of spoken text from brain   recordings

Youssef Hmamouche; Ismail Chihab; Lahoucine Kdouri; Amal El Fallah; Seghrouchni

arXiv:2409.19710·q-bio.NC·October 1, 2024

A multimodal LLM for the non-invasive decoding of spoken text from brain recordings

Youssef Hmamouche, Ismail Chihab, Lahoucine Kdouri, Amal El Fallah, Seghrouchni

PDF

Open Access

TL;DR

This paper introduces a multimodal large language model that decodes spoken text from non-invasive fMRI brain recordings, overcoming challenges like low resolution and signal noise, and demonstrating promising results in understanding brain activity related to speech.

Contribution

The paper presents a novel end-to-end multimodal LLM architecture that aligns brain activity embeddings with text, specifically designed for decoding spoken language from fMRI data.

Findings

01

Outperforms existing models in decoding accuracy

02

Captures more accurate semantic content

03

Effective despite low-resolution and noisy signals

Abstract

Brain-related research topics in artificial intelligence have recently gained popularity, particularly due to the expansion of what multimodal architectures can do from computer vision to natural language processing. Our main goal in this work is to explore the possibilities and limitations of these architectures in spoken text decoding from non-invasive fMRI recordings. Contrary to vision and textual data, fMRI data represent a complex modality due to the variety of brain scanners, which implies (i) the variety of the recorded signal formats, (ii) the low resolution and noise of the raw signals, and (iii) the scarcity of pretrained models that can be leveraged as foundation models for generative learning. These points make the problem of the non-invasive decoding of text from fMRI recordings very challenging. In this paper, we propose and end-to-end multimodal LLM for decoding spoken…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · ALIGN