daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng; Dayuan Fu; Tiantian Mi; Yumin Zhuang; Yaxing Huang; Xuefeng Li; Lyumanshan Ye; Muhang Xie; Qishuo Hua; Zhen Huang; Mohan Jiang; Hanning Wang; Jifan Lin; Yang Xiao; Jie Sun; Yunze Wu; Pengfei Liu

arXiv:2601.18418·cs.SE·January 28, 2026

daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng, Dayuan Fu, Tiantian Mi, Yumin Zhuang, Yaxing Huang, Xuefeng Li, Lyumanshan Ye, Muhang Xie, Qishuo Hua, Zhen Huang, Mohan Jiang, Hanning Wang, Jifan Lin, Yang Xiao, Jie Sun, Yunze Wu, Pengfei Liu

PDF

Open Access 4 Models 1 Datasets

TL;DR

This paper introduces daVinci-Dev, a novel agent-native mid-training approach for large language models in software engineering, emphasizing data synthesis and training methods to improve autonomous code development capabilities.

Contribution

It presents a systematic methodology for agentic mid-training using agent-native data, addressing distribution mismatch and demonstrating superior performance over prior recipes with less data.

Findings

01

Achieved 58.5% resolution rate with a 72B model.

02

Demonstrated superiority over Kimi-Dev in software engineering tasks.

03

Used less than half the tokens compared to previous methods.

Abstract

Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, **agentic mid-training**-mid-training (MT) on large-scale data that mirrors authentic agentic workflows-remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

GAIR/daVinci-Dev
dataset· 2.5k dl
2.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Software Engineering Research · Machine Learning and Data Classification