Architectural Trade-offs in Small Language Models Under Compute Constraints
Shivraj Singh Bhatti

TL;DR
This paper systematically investigates how architectural choices and training budgets affect the performance of small language models under compute constraints, revealing efficiency trade-offs and the limited transferability of techniques from large models.
Contribution
It provides an empirical analysis of small language model architectures, comparing their efficiency and performance, and evaluates the transferability of techniques like RoPE to small-scale models.
Findings
Attention-based models outperform MLPs per FLOP even at small scale.
Increasing depth or context without proper optimization can harm performance.
Techniques successful in large models do not always transfer well to small models.
Abstract
We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
