TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment
Ke Yang, Volodymyr Kindratenko, ChengXiang Zhai

TL;DR
This paper introduces a simplified, noise-reduced language environment and datasets for training and evaluating tiny language models, improving learning efficiency and instruction-following performance while reducing resource requirements.
Contribution
It proposes a novel data refinement pipeline to create leaner datasets that enhance tiny language models' training efficiency and evaluation in simplified language environments.
Findings
Leaner datasets improve tiny LM instruction-following performance.
Pretraining on leaner datasets enhances learning efficiency.
Alignment with large LM datasets enables resource-efficient analysis.
Abstract
Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
