HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou

TL;DR
HAVIR is a novel hierarchical model that reconstructs complex visual stimuli from brain activity by integrating topological and semantic information through specialized adapters and diffusion models.
Contribution
It introduces a dual-adapter framework combining topological and semantic encoding for improved brain-to-image reconstruction.
Findings
Outperforms existing models in reconstructing complex visual stimuli.
Effectively captures both structural and semantic features.
Demonstrates robustness in complex scenarios.
Abstract
Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face Recognition and Perception · Multimodal Machine Learning Applications
