Brevity is the soul of wit: Pruning long files for code generation
Aaditya K. Singh, Yu Yang, Kushal Tirumala, Mostafa Elhoushi, Ari S., Morcos

TL;DR
This paper demonstrates that pruning long code files is an effective and simple data filtering heuristic for fine-tuning large language models for code generation, improving efficiency and performance.
Contribution
It introduces a heuristic method of pruning long files for data curation in code generation, outperforming embedding-based filtering in certain regimes.
Findings
Pruning long files improves training efficiency by up to 2x.
Pruning long files yields a 3.5% absolute performance boost on HumanEval.
Embedding-based methods are confounded by file length.
Abstract
Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three types of methods are used to filter internet-scale corpora: embedding-based, heuristic-based, and classifier-based. In this work, we contrast the former two in the domain of finetuning LLMs for code generation. We find that embedding-based methods are often confounded by length, and that a simple heuristic--pruning long files--outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTeaching and Learning Programming · Embedded Systems Design Techniques · Model-Driven Software Engineering Techniques
MethodsPruning
