DuoFormer: Leveraging Hierarchical Visual Representations by Local and   Global Attention

Xiaoya Tang; Bodong Zhang; Beatrice S. Knudsen; Tolga Tasdizen

arXiv:2407.13920·cs.CV·July 22, 2024

DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Xiaoya Tang, Bodong Zhang, Beatrice S. Knudsen, Tolga Tasdizen

PDF

Open Access

TL;DR

DuoFormer is a hierarchical transformer model that combines CNNs and Vision Transformers with a novel scale attention mechanism, improving medical image analysis especially on small datasets by capturing multi-scale features.

Contribution

It introduces a hierarchical transformer architecture with scale attention, integrating CNN features with ViT for better spatial understanding and generalization.

Findings

01

Outperforms baseline models on small and medium-sized medical datasets

02

Demonstrates efficiency and generalizability across applications

03

Plug-and-play design for various CNN architectures

Abstract

We here propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are then adapted for transformer input through an innovative patch tokenization. We also introduce a 'scale attention' mechanism that captures cross-scale dependencies, complementing patch attention to enhance spatial understanding and preserve global perception. Our approach significantly outperforms baseline models on small and medium-sized medical datasets, demonstrating its efficiency and generalizability. The components are designed as plug-and-play for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection

MethodsSoftmax · Attention Is All You Need