Vision-Language Models Align with Human Neural Representations in Concept Processing
Anna Bavaresco, Marianne de Heer Kloots, Sandro Pezzelle, Raquel Fern\'andez

TL;DR
This study evaluates how different vision-language models align with human brain responses during concept processing, revealing that some models better mimic human-like understanding and are more brain-aligned than others.
Contribution
It systematically compares various VLM architectures and their alignment with neural data, highlighting the impact of training and model design on brain similarity.
Findings
VLMs outperform language-only models in brain alignment.
Some models like LXMERT and IDEFICS2 learn more human-like concepts.
Vision-language encoders are more brain-aligned than generative VLMs.
Abstract
Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role played by visual and textual context is still lacking. Here, we analyse multiple VLMs employing different strategies to integrate visual and textual modalities, along with language-only counterparts. We measure the alignment between concept representations by models and existing (fMRI) brain responses to concept words presented in two experimental conditions, where either visual (pictures) or textual (sentences) context is provided. Our results reveal that VLMs outperform the language-only counterparts in both experimental conditions. However, controlled ablation studies show that only for some VLMs, such as LXMERT and IDEFICS2, brain alignment stems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Geographic Information Systems Studies
MethodsFocus · ALIGN
