TL;DR
This study investigates whether multimodal neural networks like CLIP better explain hippocampal multivoxel activity than unimodal models, highlighting the importance of multimodality in neural representation.
Contribution
It demonstrates that multimodal models outperform unimodal ones in explaining hippocampal activity, emphasizing the role of multimodality in neural encoding.
Findings
Multimodal models better explain hippocampal activity than unimodal models.
Multimodality is a key factor in neural representation of concepts.
CLIP model shows strong alignment with hippocampal multivoxel patterns.
Abstract
The human hippocampus possesses "concept cells", neurons that fire when presented with stimuli belonging to a specific concept, regardless of the modality. Recently, similar concept cells were discovered in a multimodal network called CLIP (Radford et at., 2021). Here, we ask whether CLIP can explain the fMRI activity of the human hippocampus better than a purely visual (or linguistic) model. We extend our analysis to a range of publicly available uni- and multi-modal models. We demonstrate that "multimodality" stands out as a key component when assessing the ability of a network to explain the multivoxel activity in the hippocampus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
