Multi-modal brain encoding models for multi-modal stimuli

Subba Reddy Oota; Khushbu Pahwa; Mounika Marreddy; Maneesh Singh; Manish Gupta; Bapi S. Raju

arXiv:2505.20027·q-bio.NC·May 27, 2025

Multi-modal brain encoding models for multi-modal stimuli

Subba Reddy Oota, Khushbu Pahwa, Mounika Marreddy, Maneesh Singh, Manish Gupta, Bapi S. Raju

PDF

Open Access 1 Repo

TL;DR

This study evaluates how multi-modal Transformer models predict brain activity during multi-modal stimuli, revealing their ability to capture complex neural responses and the distinct contributions of different modalities.

Contribution

It compares cross-modal and jointly pretrained multi-modal models in predicting fMRI responses, highlighting their effectiveness and modality-specific contributions in neural encoding.

Findings

01

Multi-modal models improve alignment in language and visual brain regions.

02

Unimodal features alone do not fully explain multi-modal brain responses.

03

Both video and audio modalities contribute to brain activity in multi-modal models.

Abstract

Despite participants engaging in unimodal stimuli, such as watching images or silent videos, recent work has demonstrated that multi-modal Transformer models can predict visual brain activity impressively well, even with incongruent modality representations. This raises the question of how accurately these multi-modal models can predict brain activity when participants are engaged in multi-modal stimuli. As these models grow increasingly popular, their use in studying neural activity provides insights into how our brains respond to such multi-modal naturalistic stimuli, i.e., where it separates and integrates information across modalities through a hierarchy of early sensory regions to higher cognition. We investigate this question by using multiple unimodal and two types of multi-modal models-cross-modal and jointly pretrained-to determine which type of model is more relevant to fMRI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

subbareddy248/multi-modal-brain-stimuli
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing