TL;DR
SNLP introduces a layer-parallel inference framework for Transformers that reduces latency and improves perplexity by using structured Newton corrections and regularization, enabling practical speedups.
Contribution
The paper proposes Structured Newton Layer Parallelism (SNLP), a novel method replacing exact Jacobians with surrogate dynamics to enable layer-parallel inference in Transformers.
Findings
SNLP improves layer-parallel compatibility and reduces perplexity by up to 23.4%.
On a 0.5B Nanochat model, SNLP achieves 2.3x speedup during inference.
SNLP regularization enhances the accuracy of structured Newton iterations, benefiting both training and inference.
Abstract
Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture-induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix-sum-like update; in mHC-style architectures, HC Newton (HCN) uses the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
