Transformer tricks: Precomputing the first layer

Nils Graef

arXiv:2402.13388·cs.LG·March 13, 2024·1 cites

Transformer tricks: Precomputing the first layer

Nils Graef

PDF

Open Access 1 Repo

TL;DR

This paper introduces a technique to precompute the first layer of transformer models with RoPE, reducing inference latency and cost, especially effective for models with many layers.

Contribution

It presents a simple method to precompute the first transformer layer, enabling faster inference and lower costs for models with multiple layers.

Findings

01

Precomputing the first layer reduces latency and cost.

02

Savings are proportional to the number of layers, up to 25%.

03

Applicable to models like LLaMA, Mistral, PaLM, and Gemma.

Abstract

This micro-paper describes a trick to speed up inference of transformers with RoPE (such as LLaMA, Mistral, PaLM, and Gemma). For these models, a large portion of the first transformer layer can be precomputed, which results in slightly lower latency and lower cost-per-token. Because this trick optimizes only one layer, the relative savings depend on the total number of layers. For example, the maximum savings for a model with only 4 layers (such as Whisper tiny) is limited to 25%, while a 32-layer model is limited to 3% savings. See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openmachine-ai/transformer-tricks
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemiconductor Lasers and Optical Devices · Neural Networks and Applications

MethodsPathways Language Model · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings