Mercury: A Code Efficiency Benchmark for Code Large Language Models

Mingzhe Du; Anh Tuan Luu; Bin Ji; Qian Liu; See-Kiong Ng

arXiv:2402.07844·cs.SE·June 12, 2024·2 cites

Mercury: A Code Efficiency Benchmark for Code Large Language Models

Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, See-Kiong Ng

PDF

Open Access 1 Repo 2 Datasets

TL;DR

Mercury introduces a comprehensive code efficiency benchmark for Code LLMs, combining functional correctness and runtime performance, revealing current models' efficiency gaps and highlighting DPO as a promising optimization method.

Contribution

It presents Mercury, the first benchmark focusing on code efficiency for Code LLMs, and introduces the Beyond metric to evaluate both correctness and efficiency.

Findings

01

Leading Code LLMs score 65% on Pass but less than 50% on Beyond.

02

DPO outperforms SFT in improving code efficiency.

03

Mercury's data and code are publicly available.

Abstract

Amidst the recent strides in evaluating Large Language Models for Code (Code LLMs), existing benchmarks have mainly focused on the functional correctness of generated code, neglecting the importance of their computational efficiency. To fill the gap, we present Mercury, the first code efficiency benchmark for Code LLMs. It comprises 1,889 Python tasks, each accompanied by adequate solutions that serve as real-world efficiency baselines, enabling a comprehensive analysis of the runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and code efficiency simultaneously. On Mercury, leading Code LLMs can achieve 65% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code LLMs exhibit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

elfsong/mercury
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Particle Detector Development and Performance