Need a Small Specialized Language Model? Plan Early!
David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun

TL;DR
This paper presents methods for efficiently creating specialized small language models from large pretraining sets, using importance sampling and a novel projected network architecture, to achieve good performance with limited data and budgets.
Contribution
It introduces importance sampling for domain-specific pretraining and a new projected network architecture for efficient model adaptation, advancing small model specialization techniques.
Findings
Importance sampling effectively mimics specialized data for small models.
Projected networks enable efficient adaptation of large models to small, specialized networks.
Both methods show empirical success across various domains and training constraints.
Abstract
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose…
Peer Reviews
Decision·Submitted to ICLR 2025
* The work addresses a relevant and applicable question around the effective deployment of LLMs, possibly via their pruned or distilled variants. * Thorough record of the setup, e.g., number of GPU hours, training steps, parameter count, makes the case for SLM methods convincing and highlights the gains that can be reaped from the proposed methods.
* There is lack of theoretical or heuristic justification behind why linear projections are a suitable operator for deriving SLMs from LLMs. Some exploratory work on, e.g., the activation patterns of the LLMs across different domains that justify the linear projections, would have been nice. * A more direct comparison to other baselines such as LoRA would have made evaluations more informative. Appendix F does include a comparison, but a direct reference in the main text would make it even more
The proposed projected network method is novel and the experiment section is comprehensive.
The paper is not well-written, and some details are difficult to understand. For example: 1) Section 2.4 has many equations, but the symbols used in these equations are not defined. This makes it difficult to understand how to get the weight for importance sampling. 2) Figure 4 is a bit misleading. The perplexity for SLM-is is a flat line, but it doesn't reflect the fact that SLM-is requires significantly more compute during specialization compared to other methods. 3) The training budget is def
+ The studied problem is very meaningful. + I personally like the method of SLM-is. The idea is novel and reasonable. The performance is also good.
+ It is unclear how important the design choice of using PN is for SLM-pn. Naturally, one could consider using an Adapter/LoRA/any possible PEFT module to adapt a large model to a specific domain. It would be better to show why taking PN is better, in terms of either performance or efficiency.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsSparse Evolutionary Training
