Evaluating Language Models for Efficient Code Generation
Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng, Ding, Lingming Zhang

TL;DR
This paper introduces Differential Performance Evaluation (DPE), a new framework for reliably assessing the efficiency of Large Language Models in code generation, addressing limitations of existing benchmarks.
Contribution
The paper presents DPE, a novel evaluation framework that uses efficiency-demanding tasks and compound metrics to better measure LLM code efficiency, along with the EvalPerf benchmark.
Findings
Model size has limited impact on code efficiency.
Instruction tuning improves both correctness and efficiency.
EvalPerf is a reliable and platform-independent benchmark.
Abstract
We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code efficiency, due to their reliance on simplistic test inputs and the absence of effective compound metrics. DPE addresses these issues by focusing on efficiency-demanding programming tasks and establishing an insightful compound metric for performance evaluation. DPE operates in two phases: To curate efficiency datasets, it selects efficiency-demanding tasks from existing coding benchmarks and generates computationally expensive inputs to stress the efficiency of LLM solutions. To assess the code efficiency, DPE profiles the new solution and compares it globally against a set of reference solutions that exhibit distinct efficiency levels, where the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Natural Language Processing Techniques · Software Testing and Debugging Techniques
MethodsSparse Evolutionary Training
