ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging
Athanasios Angelakis

TL;DR
ZACH-ViT is a compact, permutation-invariant Vision Transformer designed for medical imaging, demonstrating that removing positional embeddings and class tokens can be advantageous in data-scarce scenarios with weak spatial priors.
Contribution
This paper introduces ZACH-ViT, a novel transformer architecture that eliminates positional embeddings and class tokens, tailored for medical imaging with weak spatial priors.
Findings
ZACH-ViT performs best on datasets with weak spatial priors like BloodMNIST.
Reintroducing positional support benefits datasets with stronger anatomical priors.
Removing the [CLS] token is consistently beneficial across datasets.
Abstract
Vision Transformers rely on positional embeddings and class tokens encoding fixed spatial priors. While effective for natural images, these priors may be suboptimal when spatial layout is weakly informative, a frequent condition in medical imaging. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes positional embeddings and the [CLS] token, achieving permutation-invariant patch processing via global average pooling. Zero-token denotes removal of the dedicated aggregation token and positional encodings. Patch tokens remain unchanged. Adaptive residual projections preserve training stability under strict parameter constraints. We evaluate ZACH-ViT across seven MedMNIST datasets under a strict few-shot protocol (50 samples/class, fixed hyperparameters, five seeds). Results reveal regime-dependent behavior: ZACH-ViT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
