Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)

Subba Reddy Oota; Akshett Jindal; Ishani Mondal; Khushbu Pahwa; Satya Sai Srinath Namburi; Manish Shrivastava; Maneesh Singh; Bapi S. Raju; Manish Gupta

arXiv:2505.20029·q-bio.NC·May 27, 2025

Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)

Subba Reddy Oota, Akshett Jindal, Ishani Mondal, Khushbu Pahwa, Satya Sai Srinath Namburi, Manish Shrivastava, Maneesh Singh, Bapi S. Raju, Manish Gupta

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This study investigates how instruction-tuned multimodal language models align with brain activity during visual tasks, revealing that they outperform vision-only models and encode instruction-specific visual concepts, advancing understanding of brain-model correspondence.

Contribution

The paper demonstrates that instruction-tuned multimodal models show improved brain alignment and encode instruction-specific visual concepts, highlighting their potential for modeling neural responses.

Findings

01

MLLMs outperform vision-only models in brain alignment

02

MLLMs encode instruction-specific visual concepts

03

Shared variance between different instruction types in brain encoding

Abstract

Transformer-based language models, though not explicitly trained to mimic brain recordings, have demonstrated surprising alignment with brain activity. Progress in these models-through increased size, instruction-tuning, and multimodality-has led to better representational alignment with neural data. Recently, a new class of instruction-tuned multimodal LLMs (MLLMs) have emerged, showing remarkable zero-shot capabilities in open-ended multimodal vision tasks. However, it is unknown whether MLLMs, when prompted with natural instructions, lead to better brain alignment and effectively capture instruction-specific representations. To address this, we first investigate brain alignment, i.e., measuring the degree of predictivity of neural visual activity using text output response embeddings from MLLMs as participants engage in watching natural scenes. Experiments with 10 different…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

* Novelty with instruction-tuned MLLMs: Overall, I haven’t seen too much work on brain encoding models with instruction-tuned MLLMs. This work is highly timely. * Well-designed and controlled experiment: The paper uses a controlled and well designed experiment for exploring brain fits. * Instruction breakdown: I really appreciated Figure 3 and 4. I thought it was pretty interesting to see how different instruction types fit responses in the visual cortex. This was a fairly novel result. * I als

Weaknesses

* Comparisons: Overall, I think it would be better to include a baseline with a non-instruction-tuned MLLM that has the same architecture. For example, maybe BLIP-2 instead of InstructBLIP? This would have really explored the role of instruction tuning more thoroughly in comparison. BLIP-2 should be available off the shelf and I believe this would be a salient comparison. I would also be curious about how a language model of similar size would do. * Another concern here is that improvement over

Reviewer 02Rating 6Confidence 2

Strengths

- The paper introduces a novel approach to evaluating the brain alignment of instruction-tuned MLLMs, providing valuable insights into how these models process multimodal information in relation to human brain activity. - The findings have implications for understanding how brain activity corresponds to the processing of multimodal information, which could be valuable for cognitive neuroscience and AI research.

Weaknesses

- The study relies on the NSD dataset, where subjects passively view images, which may not fully capture brain activity aligned with task-specific instructions. Active task engagement during fMRI scans could provide a more comprehensive evaluation. - How do the authors address ethical considerations regarding the use of fMRI data, especially in relation to participant privacy and data security?

Reviewer 03Rating 6Confidence 4

Strengths

- Instruction-tuned multimodal models are an interesting way to investigate task tuning, as they have higher performance than prior fine-tuned models like the taskonomy set - The comparison of retinotopic versus category-selective tuning is interesting, and may yield novel insights into high-level vision. - Variance partitioning allows a more fine-grained look into the contribution of different task tuning

Weaknesses

- While the instruction model-brain comparisons are very interesting, it is not entirely clear what is at stake. Is the central question spelled out in the intro (lines 83-85) important for better human-alignment of AI, or in order to reveal tuning properites of the human brain? If the latter, the authors should flesh this out, and state some limits of this model comparison approach (see below). The primary goal of the study should be clarified in the introduction. - The authors make a distincti

Code & Models

Repositories

subbareddy248/mllm_instruction_brain
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCategorization, perception, and language · Language, Metaphor, and Cognition

MethodsContrastive Language-Image Pre-training