NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao, Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang

TL;DR
NaturalCodeBench is a new challenging code benchmark derived from real user queries, revealing significant performance gaps in large language models and highlighting the need for more practical evaluation scenarios.
Contribution
The paper introduces NaturalCodeBench, a diverse and realistic code benchmark with a semi-automated test case generation pipeline, addressing limitations of existing benchmarks.
Findings
Models with similar HumanEval scores show large performance gaps on NCB.
Even GPT-4 performs poorly on the NCB benchmark.
NCB reveals the gap between current LLM capabilities and real-world coding requirements.
Abstract
Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-3-270mmodel· 83k dl· ♡ 100383k dl♡ 1003
- 🤗google/gemma-3-270m-itmodel· 111k dl· ♡ 569111k dl♡ 569
- 🤗unsloth/gemma-3-270m-itmodel· 24k dl· ♡ 2324k dl♡ 23
- 🤗unsloth/gemma-3-270m-it-GGUFmodel· 69k dl· ♡ 15869k dl♡ 158
- 🤗litert-community/gemma-3-270m-itmodel· 2.1k dl· ♡ 432.1k dl♡ 43
- 🤗p-e-w/gemma-3-270m-it-hereticmodel· 327 dl· ♡ 13327 dl♡ 13
- 🤗google/gemma-3-270m-qat-q4_0-unquantizedmodel· 42 dl· ♡ 842 dl♡ 8
- 🤗onnx-community/gemma-3-270m-it-ONNXmodel· 1.6k dl· ♡ 261.6k dl♡ 26
- 🤗google/gemma-3-270m-it-qat-q4_0-unquantizedmodel· 211 dl· ♡ 12211 dl♡ 12
- 🤗unsloth/gemma-3-270m-it-unsloth-bnb-4bitmodel· 10k dl· ♡ 510k dl♡ 5
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Softmax · Absolute Position Encodings
