Accelerating Vision Transformer Training via a Patch Sampling Schedule
Bradley McDanel, Chi Phuong Huynh

TL;DR
This paper proposes a Patch Sampling Schedule (PSS) for Vision Transformers that reduces training time by selectively sampling patches, maintaining accuracy and improving robustness during inference.
Contribution
The introduction of PSS allows dynamic patch sampling during training, leading to faster training with minimal accuracy loss and increased inference robustness.
Findings
0.26% accuracy reduction with 31% less training time
Enhanced robustness to patch sampling during inference
Effective for models trained from scratch and pre-trained
Abstract
We introduce the notion of a Patch Sampling Schedule (PSS), that varies the number of Vision Transformer (ViT) patches used per batch during training. Since all patches are not equally important for most vision objectives (e.g., classification), we argue that less important patches can be used in fewer training iterations, leading to shorter training time with minimal impact on performance. Additionally, we observe that training with a PSS makes a ViT more robust to a wider patch sampling range during inference. This allows for a fine-grained, dynamic trade-off between throughput and accuracy during inference. We evaluate using PSSs on ViTs for ImageNet both trained from scratch and pre-trained using a reconstruction loss function. For the pre-trained model, we achieve a 0.26% reduction in classification accuracy for a 31% reduction in training time (from 25 to 17 hours) compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · COVID-19 diagnosis using AI
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Vision Transformer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout
