Neuro-Vision to Language: Enhancing Brain Recording-based Visual   Reconstruction and Language Interaction

Guobin Shen; Dongcheng Zhao; Xiang He; Linghao Feng; Yiting Dong,; Jihang Wang; Qian Zhang; Yi Zeng

arXiv:2404.19438·cs.NE·October 15, 2024

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction

Guobin Shen, Dongcheng Zhao, Xiang He, Linghao Feng, Yiting Dong,, Jihang Wang, Qian Zhang, Yi Zeng

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel framework combining 3D brain structure analysis with visual semantics using Vision Transformer 3D, enabling improved, interpretable decoding of brain signals for visual reconstruction and language tasks without subject-specific models.

Contribution

The work presents a unified multi-level visual feature extractor integrated with LLMs, eliminating the need for customized models and enabling single-trial decoding from non-invasive brain recordings.

Findings

01

Superior performance in brain captioning and visual reconstruction tasks

02

Enhanced interpretability of neural signals through concept localization

03

Effective integration with LLMs for complex reasoning

Abstract

Decoding non-invasive brain recordings is pivotal for advancing our understanding of human cognition but faces challenges due to individual differences and complex neural signal representations. Traditional methods often require customized models and extensive trials, lacking interpretability in visual reconstruction tasks. Our framework integrates 3D brain structures with visual semantics using a Vision Transformer 3D. This unified feature extractor efficiently aligns fMRI features with multiple levels of visual embeddings, eliminating the need for subject-specific models and allowing extraction from single-trial data. The extractor consolidates multi-level visual features into one network, simplifying integration with Large Language Models (LLMs). Additionally, we have enhanced the fMRI dataset with diverse fMRI-image-related textual data to support multimodal large model development.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction· slideslive

Taxonomy

TopicsCognitive Science and Education Research

MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Linear Layer · Dense Connections