LLM Architecture, Scaling Laws, and Economics: A Quick Summary
William H. Press

TL;DR
This paper provides a concise summary of LLM architecture, scaling laws, and economic considerations, focusing on Transformer models and their cost estimates for different scales, without introducing new research findings.
Contribution
It offers a clear, summarized overview of current LLM architectures, scaling laws, and cost estimates, filling a gap in accessible condensed information.
Findings
Transformer architecture details summarized
Scaling laws for compute and memory provided
Cost estimates for various LLM scales discussed
Abstract
The current standard architecture of Large Language Models (LLMs) with QKV self-attention is briefly summarized, including the architecture of a typical Transformer. Scaling laws for compute (flops) and memory (parameters plus data) are given, along with their present (2025) rough cost estimates for the parameters of present LLMs of various scales, including discussion of whether DeepSeek should be viewed as a special case. Nothing here is new, but this material seems not otherwise readily available in summary form.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Machine Learning in Materials Science · Text Readability and Simplification
