Alternating Updates for Efficient Transformers
Cenk Baykal, Dylan Cutler, Nishanth Dikkala, Nikhil Ghosh and, Rina Panigrahy, Xin Wang

TL;DR
AltUp is a simple method that increases transformer model capacity by widening token embeddings with minimal latency increase, enabling faster inference without sacrificing accuracy.
Contribution
The paper introduces Alternating Updates (AltUp), a novel technique to widen transformer representations efficiently by updating subblocks, compatible with existing models and methods.
Findings
Achieves up to 87% speedup on SuperGLUE and SQuAD benchmarks.
Maintains comparable accuracy while significantly reducing inference latency.
Effective across diverse transformer models and language tasks.
Abstract
It has been well established that increasing scale in deep transformer networks leads to improved quality and performance. However, this increase in scale often comes with prohibitive increases in compute cost and inference latency. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation, i.e., the token embedding, while only incurring a negligible increase in latency. AltUp achieves this by working on a subblock of the widened representation at each layer and using a predict-and-correct mechanism to update the inactivated blocks. We present extensions of AltUp, such as its applicability to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Advanced Graph Neural Networks
