HARE: HumAn pRiors, a key to small language model Efficiency
Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong, Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu

TL;DR
This paper introduces HARE, a method leveraging human priors to construct concise, high-quality training data for small language models, improving efficiency and performance in resource-limited settings.
Contribution
It proposes a principle for incorporating human priors into data construction and demonstrates its effectiveness with the HARE-1.1B model, outperforming existing small language models.
Findings
HARE-1.1B achieves competitive results on benchmark datasets.
Using human priors enhances training efficiency in resource-constrained environments.
The principle guides effective data construction for small language models.
Abstract
Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
