A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic   Information

Shuxiao Ma; Linyuan Wang; Bin Yan

arXiv:2308.15142·cs.CV·August 30, 2023

A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information

Shuxiao Ma, Linyuan Wang, Bin Yan

PDF

Open Access

TL;DR

This paper introduces a multimodal visual encoding model that integrates verbal semantic information via textual data and Transformer alignment, significantly improving brain voxel prediction accuracy over previous models.

Contribution

The study presents a novel multimodal encoding network combining image and textual features, aligning them with Transformers, and mapping to brain voxel space, which enhances visual encoding performance.

Findings

01

Voxel prediction accuracy improved by approximately 15.87% in the left hemisphere.

02

Performance increased by about 4.6% in the right hemisphere.

03

Ablation experiments show better simulation of brain visual processing.

Abstract

Biological research has revealed that the verbal semantic information in the brain cortex, as an additional source, participates in nonverbal semantic tasks, such as visual encoding. However, previous visual encoding models did not incorporate verbal semantic information, contradicting this biological finding. This paper proposes a multimodal visual information encoding network model based on stimulus images and associated textual information in response to this issue. Our visual information encoding network model takes stimulus images as input and leverages textual information generated by a text-image generation model as verbal semantic information. This approach injects new information into the visual encoding model. Subsequently, a Transformer network aligns image and text feature information, creating a multimodal feature space. A convolutional network then maps from this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Image Retrieval and Classification Techniques · Educational Technology and Pedagogy

MethodsAttention Is All You Need · Dense Connections · Residual Connection · Label Smoothing · Linear Layer · Layer Normalization · Softmax · Byte Pair Encoding · Dropout · Adam