Learning Performance-Improving Code Edits

Alexander Shypula; Aman Madaan; Yimeng Zeng; Uri Alon; Jacob Gardner,; Milad Hashemi; Graham Neubig; Parthasarathy Ranganathan; Osbert Bastani; Amir; Yazdanbakhsh

arXiv:2302.07867·cs.SE·April 29, 2024·24 cites

Learning Performance-Improving Code Edits

Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner,, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, Amir, Yazdanbakhsh

PDF

Open Access 2 Repos 1 Video 3 Reviews

TL;DR

This paper presents a framework that leverages large language models and a curated dataset to improve high-level program performance optimizations in C++, achieving significant speedups over human efforts.

Contribution

It introduces a novel dataset, evaluation environment, and adaptation strategies for LLMs to perform high-level code optimizations, surpassing human performance in speedup.

Findings

01

Achieved a mean speedup of 6.86 with eight generations.

02

Set a new upper limit on speedup at 9.64, surpassing human bests.

03

Developed environment based on gem5 for reliable performance measurement.

Abstract

With the decline of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To that end, we introduce a framework for adapting LLMs to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements." To isolate and reliably evaluate the impact of program optimizations, we design an…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

+ The problem of performance editing is important. Not only should the generated code be correct, but also it should be as efficient as possible. Therefore, the paper targets an important problem. + The range of adaptations considered is wide, and the experimental results give the reader a clear image of how each adaptation strategy can be used to improve the results. + The paper uses top closed-sourced (GPT3.5 and 4) and open-sourced (CodeLLama 7B, 13B, 34B) to study the impact of PIE datas

Weaknesses

- The study mostly focuses on the percentage of performance, and in cases where the runtime of the generated program is slower or the code is incorrect, speed up is considered one. This is not a good approach. In particular, a study is needed on what category of programs LLMs often fail to produce correct code or optimized code. In such a study, it will be easier to infer in which cases it is better not to use the LLMs. Or which category of programs needs to be further improved. - For syntheti

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

The benchmark is an important artifact that the community will continue to build upon, especially as code-generating/editing large language models continue to be developed and deployed in research and production environments. The analysis and ablations are very thorough and further justify the benchmark and prompting strategies as important contributions.

Weaknesses

The experiments are very thorough, but it seems that the correctness of the models degrades with introduced methods. While it seems that the paper's primary contribution is the dataset, further analysis of correctness (as opposed to just pure speedups/optimization %) would further solidify the adaptation methods sections of the paper.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

This work is well motivated and addresses a meaningful task. The construction process described in section 2 seems reasonable. The different methods to adapt models in section 3 cover most main stream methods.

Weaknesses

The results are too good to be true. It's surprising to see that, for C++ competitive programming tasks, the fine-tuned GPT-3.5 beats the best human submission by a large margin: 6.86X speed up versus 4.06X speed up. So let's take a look at the examples in appendix A.1 which are code improvements generated by the model. Figure 3 of A.1 contains two programs that are functionally different. Figure 4(a) in A.1 is so bad that it seems unlikely in a C++ competition. Figure 5 contains two programs th

Code & Models

Repositories

Videos

Learning Performance-Improving Code Edits· slideslive

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Parallel Computing and Optimization Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Cosine Annealing · Dense Connections · Adam · Layer Normalization · Residual Connection