AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

David Smerkous; Zian Wang; Behzad Najafian

arXiv:2602.16249·cs.CV·February 19, 2026

AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

David Smerkous, Zian Wang, Behzad Najafian

PDF

Open Access

TL;DR

AFFMAE introduces a scalable, efficient vision pretraining framework that enables high-resolution training on desktop GPUs by using adaptive token merging and optimized attention kernels, matching state-of-the-art performance with reduced computational costs.

Contribution

It presents AFFMAE, a novel hierarchical pretraining method that overcomes dense-grid limitations, enabling efficient high-resolution vision model training on consumer hardware.

Findings

01

Matches ViT-MAE performance on electron microscopy segmentation

02

Reduces FLOPs by up to 7x compared to baseline

03

Halves memory usage and accelerates training on a single GPU

Abstract

Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Electron Microscopy Techniques and Applications · Advanced Neural Network Applications · Cell Image Analysis Techniques