ByteTransformer: A High-Performance Transformer Boosted for   Variable-Length Inputs

Yujia Zhai; Chengquan Jiang; Leyuan Wang; Xiaoying Jia; Shang Zhang,; Zizhong Chen; Xin Liu; Yibo Zhu

arXiv:2210.03052·cs.LG·February 21, 2023·1 cites

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang,, Zizhong Chen, Xin Liu, Yibo Zhu

PDF

Open Access 1 Repo

TL;DR

ByteTransformer introduces a padding-free, architecture-aware optimization for transformers that significantly accelerates variable-length input processing, outperforming existing frameworks on NVIDIA GPUs.

Contribution

The paper presents a novel padding-free algorithm and architecture optimizations for transformers, especially Multi-Head Attention, to improve performance on variable-length sequences.

Findings

01

Fused MHA outperforms PyTorch by 6.13x.

02

ByteTransformer surpasses state-of-the-art frameworks by up to 138%.

03

Optimization methods are applicable to various BERT-like models.

Abstract

Transformers have become keystone models in natural language processing over the past decade. They have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by transformer models generate a commensurate need to accelerate performance. Natural language processing problems are also routinely faced with variable-length sequences, as word counts commonly vary among sentences. Existing deep learning frameworks pad variable-length sequences to a maximal length, which adds significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a padding-free algorithm that liberates the entire transformer from redundant computations on zero padded tokens. In addition to algorithmic-level optimization, we provide architecture-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/bytetransformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Topic Modeling

MethodsAttention Is All You Need · How do I file a dispute with Expedia?*DisputeFastService · LAMB · DeBERTa · ALBERT · DistilBERT · Linear Layer · WordPiece · Weight Decay · Linear Warmup With Linear Decay