Tiny language models
Ronit D. Gross, Yarden Tzach, Tal Halevi, Ella Koresh, Ido Kanter

TL;DR
This study investigates tiny language models (TLMs), demonstrating that they exhibit key features of larger models, with pre-training significantly improving performance, and introduces methods for efficient low-latency TLMs.
Contribution
The paper shows that pre-trained tiny language models retain essential NLP capabilities and introduces a soft committee approach for low-latency inference.
Findings
Pre-trained TLMs outperform non-pre-trained models on classification tasks.
Performance improves with larger pre-training datasets and token overlap.
Ensemble of shallow models can replicate deep TLM accuracy.
Abstract
A prominent achievement of natural language processing (NLP) is its ability to understand and generate meaningful human language. This capability relies on complex feedforward transformer block architectures pre-trained on large language models (LLMs). However, LLM pre-training is currently feasible only for a few dominant companies due to the immense computational resources required, limiting broader research participation. This creates a critical need for more accessible alternatives. In this study, we explore whether tiny language models (TLMs) exhibit the same key qualitative features of LLMs. We demonstrate that TLMs exhibit a clear performance gap between pre-trained and non-pre-trained models across classification tasks, indicating the effectiveness of pre-training, even at a tiny scale. The performance gap increases with the size of the pre-training dataset and with greater…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
