Transformer tricks: Precomputing the first layer
Nils Graef

TL;DR
This paper introduces a technique to precompute the first layer of transformer models with RoPE, reducing inference latency and cost, especially effective for models with many layers.
Contribution
It presents a simple method to precompute the first transformer layer, enabling faster inference and lower costs for models with multiple layers.
Findings
Precomputing the first layer reduces latency and cost.
Savings are proportional to the number of layers, up to 25%.
Applicable to models like LLaMA, Mistral, PaLM, and Gemma.
Abstract
This micro-paper describes a trick to speed up inference of transformers with RoPE (such as LLaMA, Mistral, PaLM, and Gemma). For these models, a large portion of the first transformer layer can be precomputed, which results in slightly lower latency and lower cost-per-token. Because this trick optimizes only one layer, the relative savings depend on the total number of layers. For example, the maximum savings for a model with only 4 layers (such as Whisper tiny) is limited to 25%, while a 32-layer model is limited to 3% savings. See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemiconductor Lasers and Optical Devices · Neural Networks and Applications
MethodsPathways Language Model · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
