MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention

Zunhui Xia; Hongxing Li; Libin Lan

arXiv:2507.02488·cs.CV·August 6, 2025

MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention

Zunhui Xia, Hongxing Li, Libin Lan

PDF

TL;DR

MedFormer is a versatile hierarchical vision transformer designed for medical image recognition, combining a pyramid structure with a novel content-aware dual sparse attention mechanism to improve efficiency and robustness across various tasks.

Contribution

It introduces MedFormer, a generalizable medical vision transformer with a pyramid backbone and a novel dual sparse selection attention mechanism for improved efficiency and performance.

Findings

01

Outperforms existing models in accuracy and efficiency

02

Effective across multiple medical imaging tasks

03

Reduces computational load while maintaining high performance

Abstract

Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.