TL;DR
This paper explores how architectural choices in large language models affect inference efficiency and accuracy, proposing a scaling law and search framework validated by training over 200 models.
Contribution
It introduces a conditional scaling law incorporating architecture details and a search framework for designing inference-efficient, accurate LLMs.
Findings
The conditional scaling law accurately predicts optimal architectures.
Optimized models outperform open-source baselines in accuracy and throughput.
Models trained with the proposed method achieve up to 2.1% higher accuracy and 42% better inference throughput.
Abstract
Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
