Gated Subspace Inference for Transformer Acceleration
Stephen J. Thomas

TL;DR
This paper introduces a low-rank subspace inference method for transformer models that accelerates inference by selectively skipping residual computations, achieving significant speedups without retraining.
Contribution
It presents a novel subspace-based inference technique that exploits low effective rank in transformer activations for efficient acceleration without architectural changes.
Findings
Achieves 3.0x to 10.5x speedup on various models.
Maintains perplexity ratios below 1.00 and >98% token agreement.
Produces identical outputs at certain operating points.
Abstract
A method is presented for accelerating inference in transformer language models by exploiting the low effective rank of the token activation manifold at each layer. The method decomposes each activation vector into a subspace component and a residual, computes the linear-layer output on the subspace component via a cached low-rank weight image at reduced memory bandwidth, and applies a per-token gate that determines whether the residual correction is computed or skipped. The gate ensures that the output distribution is preserved to within a controllable tolerance. Validation on three model families (GPT-2 124M, GPT-J 6B, OPT 6.7B) on AMD MI300X demonstrates effective speedups of 3.0x to 10.5x on linear-layer weight reads with perplexity ratios below 1.00 and top-1 token agreement above 98%. The method requires no retraining, no architectural modification, and no approximation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
