DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer
Xiaoya Tang, Bodong Zhang, Man Minh Ho, Beatrice S. Knudsen, Tolga Tasdizen

TL;DR
DuoFormer introduces a hierarchical vision transformer that combines CNN-based multi-scale feature extraction with innovative patch tokenization and scale-wise attention, significantly improving medical image classification accuracy.
Contribution
The paper presents a novel hierarchical transformer model integrating CNNs with ViTs using patch tokenization and scale-wise attention, addressing ViTs' data dependence and enhancing multi-scale learning.
Findings
Outperforms baseline models in classification accuracy
Effectively captures intra-scale and inter-scale associations
Plug-and-play design for various CNN architectures
Abstract
Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are considered advantageous for computer-aided medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · Visual Attention and Saliency Detection
