To Code, or Not To Code? Exploring Impact of Code in Pre-training

Viraat Aryabumi; Yixuan Su; Raymond Ma; Adrien Morisot; Ivan Zhang,; Acyr Locatelli; Marzieh Fadaee; Ahmet \"Ust\"un; Sara Hooker

arXiv:2408.10914·cs.CL·August 21, 2024·3 cites

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang,, Acyr Locatelli, Marzieh Fadaee, Ahmet \"Ust\"un, Sara Hooker

PDF

Open Access 6 Models 2 Datasets

TL;DR

Including code in pre-training data significantly enhances large language models' performance across diverse natural language and reasoning tasks, not just coding, with quality improvements further boosting results.

Contribution

This study systematically analyzes the impact of code data in pre-training on various downstream tasks, revealing its critical role beyond code generation.

Findings

01

Code inclusion improves natural language reasoning by up to 8.2%.

02

Code data enhances world knowledge task performance by 4.2%.

03

Adding code increases generative win-rates by 6.6% and boosts code task performance 12-fold.

Abstract

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecond Language Acquisition and Learning · EFL/ESL Teaching and Learning