NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and   Natural User Prompts

Shudan Zhang; Hanlin Zhao; Xiao Liu; Qinkai Zheng; Zehan Qi; Xiaotao; Gu; Xiaohan Zhang; Yuxiao Dong; Jie Tang

arXiv:2405.04520·cs.CL·May 8, 2024·2 cites

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao, Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang

PDF

Open Access 1 Repo 10 Models

TL;DR

NaturalCodeBench is a new challenging code benchmark derived from real user queries, revealing significant performance gaps in large language models and highlighting the need for more practical evaluation scenarios.

Contribution

The paper introduces NaturalCodeBench, a diverse and realistic code benchmark with a semi-automated test case generation pipeline, addressing limitations of existing benchmarks.

Findings

01

Models with similar HumanEval scores show large performance gaps on NCB.

02

Even GPT-4 performs poorly on the NCB benchmark.

03

NCB reveals the gap between current LLM capabilities and real-world coding requirements.

Abstract

Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thudm/naturalcodebench
none

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Softmax · Absolute Position Encodings