Mercury: A Code Efficiency Benchmark for Code Large Language Models
Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, See-Kiong Ng

TL;DR
Mercury introduces a comprehensive code efficiency benchmark for Code LLMs, combining functional correctness and runtime performance, revealing current models' efficiency gaps and highlighting DPO as a promising optimization method.
Contribution
It presents Mercury, the first benchmark focusing on code efficiency for Code LLMs, and introduces the Beyond metric to evaluate both correctness and efficiency.
Findings
Leading Code LLMs score 65% on Pass but less than 50% on Beyond.
DPO outperforms SFT in improving code efficiency.
Mercury's data and code are publicly available.
Abstract
Amidst the recent strides in evaluating Large Language Models for Code (Code LLMs), existing benchmarks have mainly focused on the functional correctness of generated code, neglecting the importance of their computational efficiency. To fill the gap, we present Mercury, the first code efficiency benchmark for Code LLMs. It comprises 1,889 Python tasks, each accompanied by adequate solutions that serve as real-world efficiency baselines, enabling a comprehensive analysis of the runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and code efficiency simultaneously. On Mercury, leading Code LLMs can achieve 65% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code LLMs exhibit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Particle Detector Development and Performance
