Architectural Trade-offs in Small Language Models Under Compute Constraints

Shivraj Singh Bhatti

arXiv:2512.20877·cs.CL·December 25, 2025

Architectural Trade-offs in Small Language Models Under Compute Constraints

Shivraj Singh Bhatti

PDF

Open Access

TL;DR

This paper systematically investigates how architectural choices and training budgets affect the performance of small language models under compute constraints, revealing efficiency trade-offs and the limited transferability of techniques from large models.

Contribution

It provides an empirical analysis of small language model architectures, comparing their efficiency and performance, and evaluates the transferability of techniques like RoPE to small-scale models.

Findings

01

Attention-based models outperform MLPs per FLOP even at small scale.

02

Increasing depth or context without proper optimization can harm performance.

03

Techniques successful in large models do not always transfer well to small models.

Abstract

We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications