Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction
Guobin Shen, Dongcheng Zhao, Xiang He, Linghao Feng, Yiting Dong,, Jihang Wang, Qian Zhang, Yi Zeng

TL;DR
This paper introduces a novel framework combining 3D brain structure analysis with visual semantics using Vision Transformer 3D, enabling improved, interpretable decoding of brain signals for visual reconstruction and language tasks without subject-specific models.
Contribution
The work presents a unified multi-level visual feature extractor integrated with LLMs, eliminating the need for customized models and enabling single-trial decoding from non-invasive brain recordings.
Findings
Superior performance in brain captioning and visual reconstruction tasks
Enhanced interpretability of neural signals through concept localization
Effective integration with LLMs for complex reasoning
Abstract
Decoding non-invasive brain recordings is pivotal for advancing our understanding of human cognition but faces challenges due to individual differences and complex neural signal representations. Traditional methods often require customized models and extensive trials, lacking interpretability in visual reconstruction tasks. Our framework integrates 3D brain structures with visual semantics using a Vision Transformer 3D. This unified feature extractor efficiently aligns fMRI features with multiple levels of visual embeddings, eliminating the need for subject-specific models and allowing extraction from single-trial data. The extractor consolidates multi-level visual features into one network, simplifying integration with Large Language Models (LLMs). Additionally, we have enhanced the fMRI dataset with diverse fMRI-image-related textual data to support multimodal large model development.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCognitive Science and Education Research
MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Linear Layer · Dense Connections
