Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms
Shreya Saha, Ishaan Chadha, Meenakshi Khosla

TL;DR
This study systematically compares various neural network models and readout mechanisms to better understand and predict human visual responses, revealing region-specific model preferences and proposing a novel readout scheme that improves accuracy.
Contribution
It provides a comprehensive comparison of response-optimized, task-optimized, and language model-based visual models, introducing a new readout mechanism that enhances neural response prediction.
Findings
Response-optimized models excel in early to mid-level visual areas.
LLM embeddings and task-optimized models perform best in higher visual regions.
A novel readout scheme improves prediction accuracy by 3-23%.
Abstract
Over the past decade, predictive modeling of neural responses in the primate visual system has advanced significantly, largely driven by various DNN approaches. These include models optimized directly for visual recognition, cross-modal alignment through contrastive objectives, neural response prediction from scratch, and large language model embeddings.Likewise, different readout mechanisms, ranging from fully linear to spatial-feature factorized methods have been explored for mapping network activations to neural responses. Despite the diversity of these approaches, it remains unclear which method performs best across different visual regions. In this study, we systematically compare these approaches for modeling the human visual system and investigate alternative strategies to improve response predictions. Our findings reveal that for early to mid-level visual areas,…
Peer Reviews
Decision·Submitted to ICLR 2025
The use of deep neural network models to predict and understand the structure of representation in the biological visual system is a practice rife with heretofore unanswered, but deeply foundational questions as to how it should be done. Bucking a trend that far too often recycles canonical, but relatively unscrutinized methods to new models or new brain data, this submission is impressive not just for the fact that it tackles these questions head-on, but tackles so many of them simultaneously -
My major concern here (and one that I admit is not fully within the authors control, but which clarifying updates or different narrative focus could nonetheless address) is the lingering doubt as to whether even these newer, more expertly designed methods actually do give us any meaningful new “insights” about the biological system they’re nominally designed to give us insights about. An overly reductionist summary of the “findings” of this analysis with respect to the human visual brain could w
I think overall, the authors' thorough experimentation is the greatest strength of this paper: * **Systematic Comparison:** They do a reasonably systematic comparison, comparing a diverse range of models and readout mechanisms, which offers valuable insights. * **Novel Readout Mechanism:** They propose (in the context of fMRI encoders) a novel readout mechanism—the using the previously proposed spatial transformer with differentiable bilinear sampling —and show that it indeed improves predictio
The presentation of this paper could be *significantly* improved. I think the presentation quality of this paper does not match the quality of other ICLR papers I am currently reviewing or have reviewed in past years, or ICLR papers that have been accepted in prior years. The figures are unclear and lack consistent formatting, notation often unexplained, and significant wasted space. My specific concerns are below: 1. Figure 1 -> This figure is very cluttered and very confusing. Why are the su
- Compare response optimized and task optimized models directly - Compared many different model-brain mapping functions - Present a new metric for model-brain mapping
- The models tested varied along many factors making it difficult to draw strong conclusions about the role of response vs. task-optimization or vision vs. language in model’s performance. For response vs. task, these points could have been made more compelling by training the same architecture on both task and neural responses - The biggest issue is that the major findings of this paper have been shown previously (also on the same dataset). Prior work with vision and language models (e.g., Doer
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies
