TL;DR
MindHier is a hierarchical autoregressive framework for fMRI-to-image reconstruction that improves semantic fidelity, speeds up inference, and aligns better with human visual perception compared to diffusion-based methods.
Contribution
It introduces a novel hierarchical autoregressive approach with multi-level neural embeddings and scale-wise guidance, advancing beyond fixed guidance diffusion models.
Findings
Achieves superior semantic fidelity in image reconstruction.
Runs 4.67 times faster than diffusion-based baselines.
Produces more deterministic and consistent results.
Abstract
Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single high-level embedding, using it as fixed guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These…
Peer Reviews
Decision·ICLR 2026 Poster
+ The motivation of this paper is clear, and the proposed method effectively addresses it. + The manuscript of this paper is well-structured. + The experimental results achieved in this paper are quite good.
+ I believe that the method in this paper does not fully simulate the hierarchical processing mechanism of the human visual system during the fMRI representation learning stage. Specifically, different brain regions play distinct roles at various stages of the hierarchical processing mechanism. Therefore, I think dividing the fMRI signals into multiple brain regions, extracting representations separately, and then integrating them would better align with the biological mechanism. This approach m
1. This work has a well-motivated approach. The idea of aligning hierarchical fMRI features with multi-scale image generation is innovative and biologically plausible, echoing the "forest before trees" principle in human perception. 2. The use of a scale-wise autoregressive model (VAR) is a fresh direction compared to the overused diffusion models. And the VAR-based method achieves competitive results on multiple high-level metrics with a faster speed. Besides that, the reconstruction results a
1. I noticed that text information was also used in the model's input. It is necessary to provide another result that uses only text information as the input. This is how we can determine whether the model is translating the fMRI data or is more dependent on the text information. Because the pre-trained VAR is a text-to-image generation model, there is concern that fMRI does not play a major role in this model. 2. In terms of speed, I understand that most of the steps in the VAR model are carri
- The paper is easy to follow and the experiments are detailed - The use of an autoregressive generation model for image reconstruction from fMRI has been largely unexplored and the paper is a rather welcomed initiative in the field - The reconstructions shown are impressive and the stability of reconstructions is a rather attractive feature for potential future BCI applications - Important ablations on the hierarchical module are presented and lead to interesting conclusions (e.g. earlier featu
- A number of claims are rather strong compared to the results supporting them. For example, at L339 it is claimed that the author's framework is 'fundamentally more efficient' than ME2 but it is not clear what set of results supports the 'fundamental' aspect of this claim: the inference time bottleneck for each pipeline is not detailed, and thus it is unclear if this performance gain in inference-time is a property of the fMRI-to-Image pipelines or linked to the efficiency of the diffusion vs a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
