Learning Brain Representation with Hierarchical Visual Embeddings
Jiawen Zheng, Haonan Jia, Ming Li, Yuhui Zheng, Yufeng Zeng, Yang Gao, Chen Liang

TL;DR
This paper introduces a novel method for decoding visual information from brain signals by aligning hierarchical visual embeddings with brain activity, utilizing multiple pre-trained encoders and a fusion prior to improve accuracy and fidelity.
Contribution
It proposes a multi-scale, hierarchical brain-image alignment strategy with a fusion prior, advancing the understanding of visual representations in the human brain.
Findings
Improved retrieval accuracy of visual stimuli from brain signals.
Enhanced reconstruction fidelity of visual images from neural data.
Effective alignment of multi-scale visual features with brain activity.
Abstract
Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and…
Peer Reviews
Decision·ICLR 2026 Poster
The paper tackles a timely and relevant problem—decoding visual information from brain signals—using an interesting combination of methods. It combines multiple pretrained visual encoders to capture hierarchical visual representations and introduces a Fusion Prior to improve stability and cross-modal consistency. While the individual components build on existing ideas, their integration is potentially novel and thoughtfully motivated. The experiments indicate improvements over other methods
While the paper addresses an important problem, its main novelty lies in combining known components rather than introducing a fundamentally new mechanism. The Fusion Prior is interesting but I didn't find the motivation very clear, and its contribution relative to the pretrained encoders is not clearly disentangled. The approach relies heavily on VAEs for low-level reconstruction, but alternative reconstruction strategies—such as diffusion-based or adversarial priors directly aligned with EEG fe
The motivation behind the approach is clear, and the results are encouraging.
Pls see questions.
- The authors contribute a framework for both image retrieval and image reconstruction from brain signals. The idea of using multiple image encoders, fuse their representations and apply a symmetrical self-supervised loss is sound and interesting. The results of their approach also show a significant increase in performance against the previous state-of-the-art, and across several modalities (EEG and MEG). While I have some questions regarding the novelty of the components of the framework (see
- While the model outperforms existing baselines, this improvement might be partly expected due to the increased image information provided by combining multiple encoders. Prior studies (e.g., [1], [2]) have already shown that the choice of image encoder significantly influences decoding performance. Can the authors comment on this? - The authors claim that the superior performance of their model is due to involving different levels of image information (high-level from CLIP and low-level from
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace Recognition and Perception · EEG and Brain-Computer Interfaces · Visual Attention and Saliency Detection
