PCMind-2.1-Kaiyuan-2B Technical Report
Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu, Yunyi Chen, Chenyi Dang, Hengtao Tao, Hui Wang, Fangming Liu, Kaifeng Lyu, Wenguang Chen

TL;DR
This paper introduces PCMind-2.1-Kaiyuan-2B, an open-source 2-billion-parameter LLM that improves training efficiency and effectiveness under resource constraints through innovative data benchmarking, selective repetition, and curriculum training.
Contribution
It presents novel methods for data benchmarking, sample selection, and training curriculum to enhance resource-limited LLM pretraining, with open-source release of models and tools.
Findings
Competitive performance with state-of-the-art open-source models
Effective data mixing and training strategies for resource-limited settings
Open-source release facilitates broader research and application
Abstract
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Natural Language Processing Techniques · Topic Modeling
