Vision Hopfield Memory Networks
Jianfeng Wang, Amine M'Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Mykyta Smyrnov, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz

TL;DR
The paper introduces the Vision Hopfield Memory Network (V-HMN), a brain-inspired vision backbone that integrates hierarchical memory modules for improved interpretability and data efficiency, achieving competitive results on vision benchmarks.
Contribution
V-HMN is a novel hierarchical memory-based architecture that combines local and global Hopfield modules with iterative refinement, inspired by brain principles, to enhance interpretability and efficiency in vision models.
Findings
Achieved competitive performance on vision benchmarks.
Demonstrated improved interpretability over existing models.
Showed higher data efficiency and biological plausibility.
Abstract
Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The work demonstrates the benefit of a memory based system in perceptual learning, bridging the gap between memory / prototype based system and distributed system. This shows a better alignment with brain theory and practically improves the data efficiency and interpretability in perceptual models. - The work shows comprehensive ablation studying in various functions and roles of hyperparameters. - Presentation is clear and concise.
- All benchmarks are small-scale (≤ 32×32 images). Claims about “foundation backbone” or “multimodal generalizability” are not validated on large datasets (ImageNet, ADE20K, etc.). Maybe add a small data with slightly large data resolution. - The idea of using memory banks to improve the data efficiency is not new (E.g. [1] in few shot image generation – to test the limits of data efficiency). The author should consider a more comprehensive background review in terms of the computational benefi
The paper presents a novel architecture that is memory centric, which is a very interesting design philosophy that should be studied more in the design of interpretable neural networks. Empirically, this architecture is competitive with strong prior baselines. Specifically, it outperforms prior work on data efficiency. The paper is well motivated. Interpretability through explicit memory retrieval. The visualizations in Figure 2 are insightful and show that retrieved prototypes align well wit
While the empirical results look strong for data efficiency, it is unclear why this is the case. An analysis or a dicsussion section on why this architecture is more data efficient than prior work would be very helpful. Tables 3 and 4 are steps in the right direction to aid this understanding, but it appears that neither memory size nor number of steps on iterative refinement have an effect on the performance. The natural question that arises in that case is why is this architecture better than
The use of Hopfield retrieval circuits inside a vision foundational model is novel and very interesting. Effectively the operation is akin to local and global clustering of patterns in encoding spaces derived originally from the image. The results are impressive as well.
It would be good to discuss the biological plausibility of the proposed mechanism, since the Hopfield networks are found later in the pathway within the Hippocampal system and take input from the entorhinal cortex by which time the image was already analyzed in the visual pathway and via the parahippocampal and perihinal cortex, the object location and identities have already been completed. So while this is a new and interesting vision model, I am not convinced is reflective of what is going i
* The motivation is very well defined and strong. I think this paper proposes an approach than can deal with fundamental problems in modern AI architectures such as equipping models with associative retrieval and predictive coding capabilities while both being neuro-inspired. * Retrieved prototypes expose which stored patterns influence decisions which increases interpretability, a rare feature in vision backbones. * Achieves strong results with as little as 10–30% of labeled data. * The paper p
* While the predictive-coding analogy is conceptually appealing it remains lightweight where the connection could be deepened theoretically or experimentally. * While the paper presents an appealing biologically inspired narrative, the underlying model is fairly simple, essentially a combination of local/global prototype retrieval and residual refinement. The connection to predictive coding and Hopfield dynamics is more analogical than formal, and the theoretical depth is limited. Nevertheless,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace Recognition and Perception · Generative Adversarial Networks and Image Synthesis · Advanced Memory and Neural Computing
