Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

Subba Reddy Oota; Khushbu Pahwa; Prachi Jindal; Satya Sai Srinath Namburi; Maneesh Singh; Tanmoy Chakraborty; Bapi S. Raju; Manish Gupta

arXiv:2506.08277·q-bio.NC·May 21, 2026

Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

PDF

1 Repo

TL;DR

This study investigates how instruction-tuned multimodal large language models (MLLMs) align with brain activity during naturalistic stimuli, revealing that instruction tuning enhances brain-model alignment and task-specific representations.

Contribution

It provides the first comprehensive analysis of instruction-tuned MLLMs' brain alignment across multiple modalities and tasks, highlighting the impact of instruction tuning on neural representation organization.

Findings

01

Instruction-tuned MLLMs show higher brain alignment than non-instruction-tuned models (~15%).

02

Task-specific MLLM representations vary across brain regions and are associated with higher brain alignment.

03

ICL models exhibit strong semantic organization, while IT models show weak coupling to instruction semantics.

Abstract

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

subbareddy248/mllm_videos
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Multimodal Machine Learning Applications · Action Observation and Synchronization

MethodsALIGN