Learning Brain Representation with Hierarchical Visual Embeddings

Jiawen Zheng; Haonan Jia; Ming Li; Yuhui Zheng; Yufeng Zeng; Yang Gao; Chen Liang

arXiv:2602.07495·cs.CV·February 10, 2026

Learning Brain Representation with Hierarchical Visual Embeddings

Jiawen Zheng, Haonan Jia, Ming Li, Yuhui Zheng, Yufeng Zeng, Yang Gao, Chen Liang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel method for decoding visual information from brain signals by aligning hierarchical visual embeddings with brain activity, utilizing multiple pre-trained encoders and a fusion prior to improve accuracy and fidelity.

Contribution

It proposes a multi-scale, hierarchical brain-image alignment strategy with a fusion prior, advancing the understanding of visual representations in the human brain.

Findings

01

Improved retrieval accuracy of visual stimuli from brain signals.

02

Enhanced reconstruction fidelity of visual images from neural data.

03

Effective alignment of multi-scale visual features with brain activity.

Abstract

Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

The paper tackles a timely and relevant problem—decoding visual information from brain signals—using an interesting combination of methods. It combines multiple pretrained visual encoders to capture hierarchical visual representations and introduces a Fusion Prior to improve stability and cross-modal consistency. While the individual components build on existing ideas, their integration is potentially novel and thoughtfully motivated. The experiments indicate improvements over other methods

Weaknesses

While the paper addresses an important problem, its main novelty lies in combining known components rather than introducing a fundamentally new mechanism. The Fusion Prior is interesting but I didn't find the motivation very clear, and its contribution relative to the pretrained encoders is not clearly disentangled. The approach relies heavily on VAEs for low-level reconstruction, but alternative reconstruction strategies—such as diffusion-based or adversarial priors directly aligned with EEG fe

Reviewer 02Rating 6Confidence 2

Strengths

The motivation behind the approach is clear, and the results are encouraging.

Weaknesses

Pls see questions.

Reviewer 03Rating 4Confidence 2

Strengths

- The authors contribute a framework for both image retrieval and image reconstruction from brain signals. The idea of using multiple image encoders, fuse their representations and apply a symmetrical self-supervised loss is sound and interesting. The results of their approach also show a significant increase in performance against the previous state-of-the-art, and across several modalities (EEG and MEG). While I have some questions regarding the novelty of the components of the framework (see

Weaknesses

- While the model outperforms existing baselines, this improvement might be partly expected due to the increased image information provided by combining multiple encoders. Prior studies (e.g., [1], [2]) have already shown that the choice of image encoder significantly influences decoding performance. Can the authors comment on this? - The authors claim that the superior performance of their model is due to involving different levels of image information (high-level from CLIP and low-level from

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace Recognition and Perception · EEG and Brain-Computer Interfaces · Visual Attention and Saliency Detection