Scaling Inference-Efficient Language Models

Song Bian; Minghao Yan; Shivaram Venkataraman

arXiv:2501.18107·cs.LG·June 10, 2025

Scaling Inference-Efficient Language Models

Song Bian, Minghao Yan, Shivaram Venkataraman

PDF

Open Access 1 Models 1 Video

TL;DR

This paper introduces inference-aware scaling laws for large language models, optimizing model architecture and training to improve inference efficiency without sacrificing accuracy, demonstrated by the Morph-1B model.

Contribution

It extends existing scaling laws to include inference costs, proposing a method to train inference-efficient models and releasing a new model with improved latency and maintained accuracy.

Findings

01

Model architecture significantly impacts inference latency.

02

Wider and shallower models can be more efficient while preserving accuracy.

03

The Morph-1B model achieves 1.8x faster inference with comparable performance.

Abstract

Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models. Guided by our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
NaiveUser/morph-1b
model· ♡ 1
♡ 1

Videos

Scaling Inference-Efficient Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsChinchilla