To Code, or Not To Code? Exploring Impact of Code in Pre-training
Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang,, Acyr Locatelli, Marzieh Fadaee, Ahmet \"Ust\"un, Sara Hooker

TL;DR
Including code in pre-training data significantly enhances large language models' performance across diverse natural language and reasoning tasks, not just coding, with quality improvements further boosting results.
Contribution
This study systematically analyzes the impact of code data in pre-training on various downstream tasks, revealing its critical role beyond code generation.
Findings
Code inclusion improves natural language reasoning by up to 8.2%.
Code data enhances world knowledge task performance by 4.2%.
Adding code increases generative win-rates by 6.6% and boosts code task performance 12-fold.
Abstract
Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗macadeliccc/magistrate-3.2-3b-basemodel· 9 dl· ♡ 19 dl♡ 1
- 🤗macadeliccc/magistrate-3.2-3b-itmodel· 8 dl8 dl
- 🤗macadeliccc/magistrate-3.2-3b-it-GGUFmodel· 105 dl· ♡ 1105 dl♡ 1
- 🤗RichardErkhov/macadeliccc_-_magistrate-3.2-3b-it-ggufmodel· 282 dl282 dl
- 🤗RichardErkhov/macadeliccc_-_magistrate-3.2-3b-base-4bitsmodel
- 🤗RichardErkhov/macadeliccc_-_magistrate-3.2-3b-base-8bitsmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecond Language Acquisition and Learning · EFL/ESL Teaching and Learning
