StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel
Dylan Cutler, Arun Kandoor, Nishanth Dikkala, Nikunj Saunshi, Xin Wang, Rina Panigrahy

TL;DR
StagFormer introduces a novel Transformer decoding architecture that staggers execution along the sequence axis, enabling parallel processing of layers to significantly speed up decoding without sacrificing quality.
Contribution
This work presents a new staggered decoding architecture for Transformers, allowing parallelization along the depth dimension and exploring extensions like weight-sharing and bounded window attention.
Findings
Potential decoding speedup with quality neutrality.
Effective weight-sharing for memory-limited settings.
Scalability of staggering over multiple sections.
Abstract
Decoding in a Transformer based language model is inherently sequential as a token's embedding needs to pass through all the layers in the network before the generation of the next token can begin. In this work, we propose a new architecture StagFormer (Staggered Transformer), which staggers execution along the sequence axis and thereby enables parallelizing the decoding process along the depth of the model. We achieve this by breaking the dependency of the token representation at time step in layer upon the representations of tokens until time step from layer . Instead, we stagger the execution and only allow a dependency on token representations until time step . The later sections of the Transformer still get access to the "rich" representations from the prior section but only from those token positions which are one time step behind. StagFormer allows for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Softmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing
