Textbooks Are All You Need
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C\'esar Teodoro Mendes,, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo, de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin, Wang, S\'ebastien Bubeck, Ronen Eldan

TL;DR
The paper presents phi-1, a compact large language model for code trained on high-quality data, achieving competitive performance with significantly fewer parameters and training resources, demonstrating emergent capabilities.
Contribution
Introduction of phi-1, a small-scale yet effective code-focused language model trained on textbook-quality data, with notable performance and emergent properties.
Findings
phi-1 achieves 50.6% pass@1 on HumanEval
phi-1 attains 55.5% on MBPP
smaller models like phi-1-small still perform well
Abstract
We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.
Peer Reviews
Decision·Submitted to ICLR 2024
+ The paper achieves state-of-the-art results on code generation benchmarks with a much smaller model trained on far less data. This re-emphasizes the power of high quality, tailored training data. + The model requires less compute resources for training compared to larger models. + The paper also shows rigorous evaluation of potential training set contamination.
- The paper has low novelty, and the importance of data quality in LLMs is well known. - The paper is only evaluated mainly on short Python functions. How would it perform on more complex, real-world coding tasks.
This paper proposes a relatively high efficiency approach to training LLMs. It is impressive to get such results from a model trained on 8 A100s in 4 days. This makes this work approachable from academic research labs, which takes a step towards reversing the trend of pursuing ever larger datasets with larger models and computational requirements. This also has implications for energy efficiency and sustainability. The paper itself is well-written and quite clear. The authors have commited to
Key details on the data generation process are not shared for "proprietory reasons", yet this is central to the paper, its proposition and its results. The lack of a broader impacts section weakens this paper. If this weakness is addressed, I am happy to improve my score.
This paper adds to the expanding body of work that underscores the benefits of using quality, and broadens the empirical evidence to the field of LMs for code. The results presented in the paper are noteworthy; with a comparatively small model, they manage to surpass much larger models that were trained on more tokens. Another contribution of the paper is the utilization of existing LLMs to produce quality synthetic data at scale for pretraining the smaller model.
The paper posits the claim that meticulously curated data, combined with generating synthetic training data, can train smaller models that surpass larger ones. However, there are significant gaps regarding the creation of the training data. Specifically, the authors deliberately omit certain details, pointing to other papers that have taken a similar approach (as mentioned in footnote 1 on page 2). While the decision to withhold such details lies within a researcher's discretion, assessing the p
Code & Models
- 🤗microsoft/phi-1model· 5.8k dl· ♡ 2185.8k dl♡ 218
- 🤗michaelfeil/ct2fast-phi-1model· 13 dl13 dl
- 🤗OpenNMT/phi-1-ct2-int8model· 13 dl13 dl
- 🤗TommyZQ/phi-1model
- 🤗RichardErkhov/microsoft_-_phi-1-4bitsmodel· 3 dl3 dl
- 🤗RichardErkhov/microsoft_-_phi-1-8bitsmodel· 5 dl5 dl
- 🤗RichardErkhov/microsoft_-_phi-1-ggufmodel· 83 dl83 dl
- 🤗professorf/phi-1-ggufmodel· 30 dl· ♡ 130 dl♡ 1
- 🤗kenhktsui/llm-data-textbook-quality-fasttext-classifier-v2model· 218 dl· ♡ 28218 dl♡ 28
Videos
9 New Gemini Leaks, Code Llama and A Major AI Consciousness Paper· youtube
Phi-1: A 'Textbook' Model· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Layer Normalization · Adam · Weight Decay
