Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding

Hamid Abdollahi; Amir Hossein Mansouri Majoumerd; Amir Hossein Bagheri Baboukani; Amir Abolfazl Suratgar; Mohammad Bagher Menhaj

arXiv:2507.19052·cs.CV·July 28, 2025

Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding

Hamid Abdollahi, Amir Hossein Mansouri Majoumerd, Amir Hossein Bagheri Baboukani, Amir Abolfazl Suratgar, Mohammad Bagher Menhaj

PDF

TL;DR

This study evaluates brain encoding models using visual and auditory features, revealing a trade-off between model complexity and generalization, and highlighting the dominance of audiovisual streams in neural encoding.

Contribution

It introduces a rigorous evaluation of multimodal brain encoding models on out-of-distribution data, emphasizing the importance of model simplicity for robustness and the dominance of audiovisual information.

Findings

01

Linear models outperform attention-based models on OOD data by 18%.

02

Linguistic features do not enhance predictive accuracy for familiar languages.

03

High-fidelity speech representations improve auditory cortex predictions.

Abstract

Predicting brain activity in response to naturalistic, multimodal stimuli is a key challenge in computational neuroscience. While encoding models are becoming more powerful, their ability to generalize to truly novel contexts remains a critical, often untested, question. In this work, we developed brain encoding models using state-of-the-art visual (X-CLIP) and auditory (Whisper) feature extractors and rigorously evaluated them on both in-distribution (ID) and diverse out-of-distribution (OOD) data. Our results reveal a fundamental trade-off between model complexity and generalization: a higher-capacity attention-based model excelled on ID data, but a simpler linear model was more robust, outperforming a competitive baseline by 18\% on the OOD set. Intriguingly, we found that linguistic features did not improve predictive accuracy, suggesting that for familiar languages, neural encoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.