Building Vision Models upon Heat Conduction
Zhaozhi Wang, Yue Liu, Yunjie Tian, Yunfan Liu, Yaowei Wang, Qixiang, Ye

TL;DR
This paper introduces the Heat Conduction Operator (HCO), a novel method inspired by physical heat conduction, to improve visual representation models by enabling large receptive fields with reduced computational costs.
Contribution
The study presents HCO, a plug-and-play operator based on heat diffusion principles, that enhances vision models with global receptive fields while reducing computational complexity and resource usage.
Findings
HCO achieves up to 3x throughput improvement.
HCO reduces GPU memory usage by 80%.
HCO decreases FLOPs by 35% compared to Swin-Transformer.
Abstract
Visual representation models leveraging attention mechanisms are challenged by significant computational overhead, particularly when pursuing large receptive fields. In this study, we aim to mitigate this challenge by introducing the Heat Conduction Operator (HCO) built upon the physical heat conduction principle. HCO conceptualizes image patches as heat sources and models their correlations through adaptive thermal energy diffusion, enabling robust visual representations. HCO enjoys a computational complexity of O(N^1.5), as it can be implemented using discrete cosine transformation (DCT) operations. HCO is plug-and-play, combining with deep learning backbones produces visual representation models (termed vHeat) with global receptive fields. Experiments across vision tasks demonstrate that, beyond the stronger performance, vHeat achieves up to a 3x throughput, 80% less GPU memory…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The motivation is reasonable. 2. The result is better than Swin. 3. Compared with Transformer of O(N2) computational complexity, this method becomes O(N1.5).
1. The main purpose is to obtain the global receptive fields, thus, some MLP-based Backbones [1-6] should be added to compare. Maybe, you could discuss computational efficiency and performance in ImageNet, and you can also visualize their receptive fields. [1] Strip-MLP: Efficient Token Interaction for Vision MLP. ICCV 2023 [2] RaMLP: Vision MLP via Region-aware Mixing. IJCAI 2023 [3] Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition. TPAMI 2023. [4] Res
1. vHeat’s foundation in the physical principles of heat conduction offers a unique perspective on visual representation. By conceptualizing image patches as heat sources and modeling their interactions through thermal energy diffusion, it provides a fresh, interpretable approach to feature extraction that stands apart from traditional self-attention mechanisms. 2.Despite the lower complexity, vHeat still achieves global receptive fields, allowing it to capture long-range dependencies within ima
1. Although vHeat performs well in experimental settings, the manuscript lacks substantial evidence of its effectiveness in real-world industrial applications. Validation in practical deployments, especially compared to established self-attention models, is necessary to confirm its utility beyond controlled environments. 2.While vHeat draws an analogy between heat conduction and feature propagation, certain aspects of this analogy could benefit from further elaboration. For example, how the ther
1. A novel view of visual representation 2. Extensive experiments and good results.
1. I would like the authors to discuss the proposed method in relation to relative positional embeddings, as I believe they may share similar high-level concepts. 2. I would also like to understand the motivation behind the additional branch (Linear + SiLU) shown in Fig. 2(b). 3. I am curious about the introduction of DWConv in the block and how much it contributes to the overall performance. While the authors presented this in the supplementary materials, I still do not fully grasp the motiva
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science
MethodsDiffusion
