PCMind-2.1-Kaiyuan-2B Technical Report

Kairong Luo; Zhenbo Sun; Xinyu Shi; Shengqi Chen; Bowen Yu; Yunyi Chen; Chenyi Dang; Hengtao Tao; Hui Wang; Fangming Liu; Kaifeng Lyu; Wenguang Chen

arXiv:2512.07612·cs.CL·December 9, 2025

PCMind-2.1-Kaiyuan-2B Technical Report

Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu, Yunyi Chen, Chenyi Dang, Hengtao Tao, Hui Wang, Fangming Liu, Kaifeng Lyu, Wenguang Chen

PDF

Open Access 1 Models 3 Datasets

TL;DR

This paper introduces PCMind-2.1-Kaiyuan-2B, an open-source 2-billion-parameter LLM that improves training efficiency and effectiveness under resource constraints through innovative data benchmarking, selective repetition, and curriculum training.

Contribution

It presents novel methods for data benchmarking, sample selection, and training curriculum to enhance resource-limited LLM pretraining, with open-source release of models and tools.

Findings

01

Competitive performance with state-of-the-art open-source models

02

Effective data mixing and training strategies for resource-limited settings

03

Open-source release facilitates broader research and application

Abstract

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
thu-pacman/PCMind-2.1-Kaiyuan-2B
model· 14 dl· ♡ 29
14 dl♡ 29

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Natural Language Processing Techniques · Topic Modeling