daVinci-LLM:Towards the Science of Pretraining

Yiwei Qin; Yixiu Liu; Tiantian Mi; Muhang Xie; Zhen Huang; Weiye Si; Pengrui Lu; Siyuan Feng; Xia Wu; Liming Liu; Ye Luo; Jinlong Hou; Qipeng Guo; Yu Qiao; Pengfei Liu

arXiv:2603.27164·cs.AI·March 31, 2026

daVinci-LLM:Towards the Science of Pretraining

Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang, Weiye Si, Pengrui Lu, Siyuan Feng, Xia Wu, Liming Liu, Ye Luo, Jinlong Hou, Qipeng Guo, Yu Qiao, Pengfei Liu

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper advances the science of pretraining large language models by systematically exploring data processing, curriculum strategies, and evaluation protocols using an open, transparent approach with a 3B-parameter model trained on 8T tokens.

Contribution

It introduces a fully-open methodology for pretraining research, including detailed data processing pipelines, systematic ablations, and a new framework for understanding data influence.

Findings

01

Processing depth significantly improves model capabilities.

02

Different domains require adaptive data strategies.

03

Balanced compositional data prevents performance collapse.

Abstract

The foundational pretraining phase determines a model's capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gair-nlp/daVinci-LLM
github

Models

🤗
SII-GAIR-NLP/davinci-llm-model
model· 58 dl· ♡ 29
58 dl♡ 29

Datasets

SII-GAIR-NLP/davinci-llm-data
dataset· 989 dl
989 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.