Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier
Aristeidis Tsaris, Chengming Zhang, Xiao Wang, Junqi Yin, Siyan Liu,, Moetasim Ashfaq, Ming Fan, Jong Youl Choi, Mohamed Wahib, Dan Lu, Prasanna, Balaprakash, Feiyi Wang

TL;DR
This paper introduces distributed sequence parallelism for Vision Transformers, enabling processing of up to 1 million tokens, and demonstrates significant improvements in climate modeling accuracy with large-scale models.
Contribution
Developed the first sequence parallelism method for ViTs, allowing training on extremely long sequences and scaling to 10B parameters with high efficiency.
Findings
Achieved 94% batch scaling efficiency on 2,048 GPUs.
Enabled training of models with 188K sequence length.
Improved climate prediction accuracy by 20%.
Abstract
Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, tensor parallelism, and flash attention strategies, to scale beyond single GPU memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Optical Systems and Laser Technology · Optical measurement and interference techniques
