ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang,, Zizhong Chen, Xin Liu, Yibo Zhu

TL;DR
ByteTransformer introduces a padding-free, architecture-aware optimization for transformers that significantly accelerates variable-length input processing, outperforming existing frameworks on NVIDIA GPUs.
Contribution
The paper presents a novel padding-free algorithm and architecture optimizations for transformers, especially Multi-Head Attention, to improve performance on variable-length sequences.
Findings
Fused MHA outperforms PyTorch by 6.13x.
ByteTransformer surpasses state-of-the-art frameworks by up to 138%.
Optimization methods are applicable to various BERT-like models.
Abstract
Transformers have become keystone models in natural language processing over the past decade. They have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by transformer models generate a commensurate need to accelerate performance. Natural language processing problems are also routinely faced with variable-length sequences, as word counts commonly vary among sentences. Existing deep learning frameworks pad variable-length sequences to a maximal length, which adds significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a padding-free algorithm that liberates the entire transformer from redundant computations on zero padded tokens. In addition to algorithmic-level optimization, we provide architecture-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Topic Modeling
MethodsAttention Is All You Need · How do I file a dispute with Expedia?*DisputeFastService · LAMB · DeBERTa · ALBERT · DistilBERT · Linear Layer · WordPiece · Weight Decay · Linear Warmup With Linear Decay
