A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information
Shuxiao Ma, Linyuan Wang, Bin Yan

TL;DR
This paper introduces a multimodal visual encoding model that integrates verbal semantic information via textual data and Transformer alignment, significantly improving brain voxel prediction accuracy over previous models.
Contribution
The study presents a novel multimodal encoding network combining image and textual features, aligning them with Transformers, and mapping to brain voxel space, which enhances visual encoding performance.
Findings
Voxel prediction accuracy improved by approximately 15.87% in the left hemisphere.
Performance increased by about 4.6% in the right hemisphere.
Ablation experiments show better simulation of brain visual processing.
Abstract
Biological research has revealed that the verbal semantic information in the brain cortex, as an additional source, participates in nonverbal semantic tasks, such as visual encoding. However, previous visual encoding models did not incorporate verbal semantic information, contradicting this biological finding. This paper proposes a multimodal visual information encoding network model based on stimulus images and associated textual information in response to this issue. Our visual information encoding network model takes stimulus images as input and leverages textual information generated by a text-image generation model as verbal semantic information. This approach injects new information into the visual encoding model. Subsequently, a Transformer network aligns image and text feature information, creating a multimodal feature space. A convolutional network then maps from this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image Retrieval and Classification Techniques · Educational Technology and Pedagogy
MethodsAttention Is All You Need · Dense Connections · Residual Connection · Label Smoothing · Linear Layer · Layer Normalization · Softmax · Byte Pair Encoding · Dropout · Adam
