Aligning Brain Signals with Multimodal Speech and Vision Embeddings

Kateryna Shapovalenko; Quentin Auster

arXiv:2511.00065·cs.LG·November 11, 2025

Aligning Brain Signals with Multimodal Speech and Vision Embeddings

Kateryna Shapovalenko, Quentin Auster

PDF

Open Access

TL;DR

This paper investigates how layered multimodal embeddings from wav2vec2 and CLIP models align with brain signals during speech perception, aiming to better understand neural processing of language as a multisensory experience.

Contribution

It compares layer-specific and combined embeddings from wav2vec2 and CLIP models in relation to EEG data, advancing methods for neural decoding of language.

Findings

01

Multimodal, layer-aware embeddings improve alignment with brain activity.

02

Combining embeddings from different layers enhances decoding accuracy.

03

Layer-specific embeddings reveal insights into neural language processing.

Abstract

When we hear the word "house", we don't just process sound, we imagine walls, doors, memories. The brain builds meaning through layers, moving from raw acoustics to rich, multimodal associations. Inspired by this, we build on recent work from Meta that aligned EEG signals with averaged wav2vec2 speech embeddings, and ask a deeper question: which layers of pre-trained models best reflect this layered processing in the brain? We compare embeddings from two models: wav2vec2, which encodes sound into language, and CLIP, which maps words to images. Using EEG recorded during natural speech perception, we evaluate how these embeddings align with brain activity using ridge regression and contrastive decoding. We test three strategies: individual layers, progressive concatenation, and progressive summation. The findings suggest that combining multimodal, layer-aware representations may bring us…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultisensory perception and integration · Face Recognition and Perception · Neurobiology of Language and Bilingualism