A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Jingyu Lu, Haonan Wang, Qixiang Zhang, Xiaomeng Li

TL;DR
This paper introduces VCFlow, a hierarchical brain decoding framework inspired by the visual system, enabling rapid, subject-agnostic reconstruction of visual experiences from fMRI with minimal accuracy loss.
Contribution
The paper presents VCFlow, a novel hierarchical architecture with contrastive learning for fast, subject-agnostic visual decoding from fMRI data, reducing training time and computational costs.
Findings
VCFlow achieves 93% of the accuracy of subject-specific models.
Reconstructs videos in 10 seconds without retraining.
Requires only 7% of the data needed by traditional methods.
Abstract
Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations,…
Peer Reviews
Decision·ICLR 2026 Poster
+ This paper provides a good overview of the work on fMRI-to-video reconstruction. + This paper focuses on an important issue.
1. The abstract and the first two paragraphs of the introduction strongly emphasize the issue of transfer/generalization to unseen subjects. However, in the third paragraph of the introduction (lines 73–94), the methods described by the authors have no relation to this issue. This creates a jarring shift in the writing. Similarly, after reading the Introduction, I think the authors should focus on how to achieve transfer to unseen subjects. Yet, in the methods section, only one design is actuall
- The paper is well-written and mostly easy to follow (i.e up to the complexity of the presented method) - The motivation for subject-agnostic model is clearly explained and detailed. - Most fMRI-to-video decoding methods are subject-specific which is a significant bottleneck for clinical applications while this method addresses this fundamental challenge, making it a first important step towards better practicality.
- The method seems quite complex with many components while the ablation seems quite incomplete and doesn’t ensure all the elements are crucial to the final performance. - Missing information on how they generate the final videos in less than 10 seconds (claim in abstract). Could they be more specific into which generative model they use and exactly which brain-predicted embeddings they give to it (as only stable diffusion is mentioned in figure 3 but without further details, maybe explained in
1. The motivation of the article is very good. I completely agree that the brain decoding model should further focus on decoding for multiple individuals, especially for unseen subjects. Because we cannot expect users to undergo prolonged collection processes in practical use. Generalization among the subjects is a problem that must be addressed. 2. The article is well-written and clearly expressed. The experiment was also conducted thoroughly. 3. The model not only designs multiple different
1. I agree that generalization to unseen subjects is very important. However, I do not fully agree that the reliability of the method can be demonstrated in the video reconstruction task. As is well known, current brain video reconstruction largely depends on the pre-trained Diffusion model to achieve its effects. Then, we calculate the indicators based on the generated videos. I believe the results are heavily influenced by the generation model rather than actual brain decoding. (In other words
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace Recognition and Perception · Generative Adversarial Networks and Image Synthesis · EEG and Brain-Computer Interfaces
