HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang; Dong Liang; Hairong Zheng; Yihang Zhou

arXiv:2506.06035·cs.CV·July 8, 2025

HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou

PDF

Open Access

TL;DR

HAVIR is a novel hierarchical model that reconstructs complex visual stimuli from brain activity by integrating topological and semantic information through specialized adapters and diffusion models.

Contribution

It introduces a dual-adapter framework combining topological and semantic encoding for improved brain-to-image reconstruction.

Findings

01

Outperforms existing models in reconstructing complex visual stimuli.

02

Effectively captures both structural and semantic features.

03

Demonstrates robustness in complex scenarios.

Abstract

Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face Recognition and Perception · Multimodal Machine Learning Applications