DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer

Xiaoya Tang; Bodong Zhang; Man Minh Ho; Beatrice S. Knudsen; Tolga Tasdizen

arXiv:2506.12982·cs.CV·June 17, 2025

DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer

Xiaoya Tang, Bodong Zhang, Man Minh Ho, Beatrice S. Knudsen, Tolga Tasdizen

PDF

Open Access

TL;DR

DuoFormer introduces a hierarchical vision transformer that combines CNN-based multi-scale feature extraction with innovative patch tokenization and scale-wise attention, significantly improving medical image classification accuracy.

Contribution

The paper presents a novel hierarchical transformer model integrating CNNs with ViTs using patch tokenization and scale-wise attention, addressing ViTs' data dependence and enhancing multi-scale learning.

Findings

01

Outperforms baseline models in classification accuracy

02

Effectively captures intra-scale and inter-scale associations

03

Plug-and-play design for various CNN architectures

Abstract

Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are considered advantageous for computer-aided medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · Visual Attention and Saliency Detection