LaVCa: LLM-assisted Visual Cortex Captioning
Takuya Matsuyama, Shinji Nishimoto, Yu Takagi

TL;DR
This paper introduces LaVCa, a novel LLM-assisted method that generates natural language captions for images to better interpret and analyze voxel responses in the human visual cortex, revealing detailed functional differentiation.
Contribution
LaVCa is the first approach to use large language models for generating captions that explain voxel selectivity in the visual cortex, improving interpretability of neural responses.
Findings
LaVCa produces more accurate captions of voxel selectivity than previous methods.
Captions generated by LaVCa capture detailed properties at inter- and intra-voxel levels.
Reveals fine-grained functional differentiation within visual cortex regions.
Abstract
Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by…
Peer Reviews
Decision·ICLR 2026 Poster
The problem is a very interesting one, and understanding the selectivity distribution across the visual cortex can be of medical and scientific interest. The approach is simple and consists of modular components that could be swapped out with future improvement in model quality.
1. It doesn't really make sense to me why the authors utilize an encoder to rank images in the first step. Each voxel has ground truth most activating images (from the fMRI beta weights). So it seems like this step is totally unnecessary and would degrade the model by replace real data rank with a predicted rank. 2. The keyword extraction stage (Figure 3d) seems unnecessary. It is not clear that voxels would only reason to entities in an image, compared to actions or adjectives. 3. The in-tex
The strengths of this paper lie in its creative use of LLMs to generate natural-language captions that accurately describe voxel-level visual selectivity in the human cortex, surpassing prior methods in both interpretability and diversity of representations. The approach is robust across benchmarks, clearly demonstrates broader and finer-grained conceptual tuning in visual areas, and is built with modularity and reproducibility in mind, thereby enhancing its impact on both neuroscience and neu
- While the captions are more diverse, the method often omits very local, fine-grained details in face-selective or highly specialized voxels, likely a result of summary steps and current limitations of captioning models? - Some hyperparameters (e.g., keywords and image set size) influence accuracy, and there is limited exploration of more structured or hierarchical compositional strategies for capturing multi-concept or multimodal selectivity. - Benchmark comparisons focus primarily on accur
The method leverages state-of-the-art AI methods to improve not only prediction but also interpretability of fMRI data The model generalizes fMRI responses to held out images to generate rich descriptions of each voxel The main figures replicate prior results of known neural tuning and may be a source of hypothesis generation for future studies. The paper is well written with clear visuals
The main weakness is that it is unclear what the advantage of LaVCa is relative to using the direct fMRI responses from NSD. Given the large set of images shown in that fMRI dataset, which were drawn from MS-Coco, it seems possible to find the NSD image (or set of images) that drives the highest response in each voxel and do the same LLM extraction / sentence composition on the captions for those images. (This could be done in a cross-validated / encoding framework if desired.) In most of the pa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
