A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli
Qianyi He, Yuan Chang Leong

TL;DR
This paper introduces a multimodal sequence-to-sequence Transformer model that predicts brain responses to naturalistic stimuli by integrating visual, auditory, and language features, capturing long-range temporal dependencies and individual variability.
Contribution
It presents a novel multimodal Transformer architecture with dual cross-attention and a shared encoder, improving brain response prediction across subjects and stimulus types.
Findings
Achieved strong performance on in-distribution data
Performed well on out-of-distribution data
Effectively modeled long-range temporal dependencies
Abstract
The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states and current stimuli via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of the content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Functional Brain Connectivity Studies · Emotion and Mood Recognition
