MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion
Yizhuo Lu, Changde Du, Qiongyi zhou, Dianpeng Wang, Huiguang He

TL;DR
MindDiffuser is a two-stage model that reconstructs images from brain activity by aligning semantic and structural features, surpassing previous methods and demonstrating neurobiological plausibility.
Contribution
The paper introduces a novel two-stage approach combining VQ-VAE, CLIP, and Stable Diffusion for controlled image reconstruction from fMRI data.
Findings
Outperforms state-of-the-art on NSD dataset
Achieves cohesive semantic and structural alignment
Demonstrates neurobiological plausibility of the model
Abstract
Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Despite the advancements in complex image reconstruction techniques, the challenge persists in achieving a cohesive alignment of both semantic (concepts and objects) and structure (position, orientation, and size) with the image stimuli. To address the aforementioned issue, we propose a two-stage image reconstruction model called MindDiffuser. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into Stable Diffusion, which yields a preliminary image that contains semantic information. In Stage 2, we utilize the CLIP visual feature decoded from fMRI as supervisory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural dynamics and brain function · Visual Attention and Saliency Detection · Face Recognition and Perception
MethodsContrastive Language-Image Pre-training · Diffusion · ALIGN · VQ-VAE
