Textbooks Are All You Need

Suriya Gunasekar; Yi Zhang; Jyoti Aneja; Caio C\'esar Teodoro Mendes,; Allie Del Giorno; Sivakanth Gopi; Mojan Javaheripi; Piero Kauffmann; Gustavo; de Rosa; Olli Saarikivi; Adil Salim; Shital Shah; Harkirat Singh Behl; Xin; Wang; S\'ebastien Bubeck; Ronen Eldan; Adam Tauman Kalai; Yin Tat Lee,; Yuanzhi Li

arXiv:2306.11644·cs.CL·October 3, 2023·98 cites

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C\'esar Teodoro Mendes,, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo, de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin, Wang, S\'ebastien Bubeck, Ronen Eldan

PDF

Open Access 9 Models 5 Datasets 2 Videos 3 Reviews

TL;DR

The paper presents phi-1, a compact large language model for code trained on high-quality data, achieving competitive performance with significantly fewer parameters and training resources, demonstrating emergent capabilities.

Contribution

Introduction of phi-1, a small-scale yet effective code-focused language model trained on textbook-quality data, with notable performance and emergent properties.

Findings

01

phi-1 achieves 50.6% pass@1 on HumanEval

02

phi-1 attains 55.5% on MBPP

03

smaller models like phi-1-small still perform well

Abstract

We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

+ The paper achieves state-of-the-art results on code generation benchmarks with a much smaller model trained on far less data. This re-emphasizes the power of high quality, tailored training data. + The model requires less compute resources for training compared to larger models. + The paper also shows rigorous evaluation of potential training set contamination.

Weaknesses

- The paper has low novelty, and the importance of data quality in LLMs is well known. - The paper is only evaluated mainly on short Python functions. How would it perform on more complex, real-world coding tasks.

Reviewer 02Rating 8· accept, good paperConfidence 5

Strengths

This paper proposes a relatively high efficiency approach to training LLMs. It is impressive to get such results from a model trained on 8 A100s in 4 days. This makes this work approachable from academic research labs, which takes a step towards reversing the trend of pursuing ever larger datasets with larger models and computational requirements. This also has implications for energy efficiency and sustainability. The paper itself is well-written and quite clear. The authors have commited to

Weaknesses

Key details on the data generation process are not shared for "proprietory reasons", yet this is central to the paper, its proposition and its results. The lack of a broader impacts section weakens this paper. If this weakness is addressed, I am happy to improve my score.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

This paper adds to the expanding body of work that underscores the benefits of using quality, and broadens the empirical evidence to the field of LMs for code. The results presented in the paper are noteworthy; with a comparatively small model, they manage to surpass much larger models that were trained on more tokens. Another contribution of the paper is the utilization of existing LLMs to produce quality synthetic data at scale for pretraining the smaller model.

Weaknesses

The paper posits the claim that meticulously curated data, combined with generating synthetic training data, can train smaller models that surpass larger ones. However, there are significant gaps regarding the creation of the training data. Specifically, the authors deliberately omit certain details, pointing to other papers that have taken a similar approach (as mentioned in footnote 1 on page 2). While the decision to withhold such details lies within a researcher's discretion, assessing the p

Code & Models

Models

Datasets

Videos

9 New Gemini Leaks, Code Llama and A Major AI Consciousness Paper· youtube

Phi-1: A 'Textbook' Model· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Layer Normalization · Adam · Weight Decay