MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Jianbo Dai; Jianqiao Lu; Yunlong Feng; Guangtao Zeng; Rongju Ruan; Ming Cheng; Dong Huang; Haochen Tan; Zhijiang Guo

arXiv:2405.11430·cs.CL·August 19, 2025·2 cites

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Jianbo Dai, Jianqiao Lu, Yunlong Feng, Guangtao Zeng, Rongju Ruan, Ming Cheng, Dong Huang, Haochen Tan, Zhijiang Guo

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces MHPP, a new challenging dataset for evaluating language models' code generation abilities beyond existing benchmarks, revealing limitations of current models and improving understanding of their true capabilities.

Contribution

The paper presents MHPP, a curated dataset of hard Python problems that better assesses LLMs' reasoning and coding skills, addressing gaps in existing benchmarks.

Findings

01

Many high-performing models on HumanEval underperform on MHPP.

02

MHPP uncovers previously unknown limitations of LLMs.

03

Evaluation pipeline and leaderboard are publicly available.

Abstract

Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0\% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs' code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 210 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs' abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

- A key question about whether current LLMs have mastered function-level code generation, and the detailed breakdown of 7 challenge types effectively motivate the need for this benchmark.

Weaknesses

- The benchmark includes 210 problems, which, while comparable to HumanEval’s 164, may be insufficient for broader generalizability. Note that recent benchmarks, like BigCodeBench [1], offer over 1K problems. - Current code generation benchmarks including this work are vulnerable to future data contamination as the test set is often public. To mitigate this, splitting the benchmark into validation and hidden test sets, with evaluations on the test set limited to submissions, may be advisable, f

Reviewer 02Rating 5Confidence 5

Strengths

* **Benchmark and Error Analysis.** The authors analyze existing benchmarks MBPP and HumanEval identifying mistakes and contamination. They also produce a manual categorization of mistakes made on HumanEval * **New benchmark.** The authors provide a manually curated benchmark of 210 problems focusing on mistakes identified on HumanEval. This provides confidence in the quality of the benchmark problem statements * **Qualitative Analysis.** The authors provide insights into model failures through

Weaknesses

* **Benchmark Size.** The benchmark only consists of 210 benchmark problems. The benchmark size is limited and puts concerns over empirical findings in question. This is an even more serious concern for problem for the category-wise results presented where the number of benchmark samples would be around 30-40 (Tables 2, 3 and Figure 4) -- potentially increasing noise levels to over 10/15% making the results unreliable. * **Confidence Intervals.** As I understand, section 5.1 computes the confid

Reviewer 03Rating 3Confidence 4

Strengths

1. The benchmark is human-created and (for the moment) is unlikely to be a part of any pre-training corpora 2. The authors show that the problems are challenging enough to leave some headroom, even for the SOTA models

Weaknesses

Overall, I do not see the point of this benchmark in terms of bringing something to the field that is not already out there: 1. ~14 tests on average per sample makes it better than HumanEval and MBPP but is still far outmatched by benchmarks such as EvalPlus [1] 2. In terms of being a challenging test for CodeLMs, due to limitations in chosen domains, library usage and question difficulty, it is, on average, well outdone by existing benchmarks like BigCodeBench [2], ClassEval [3] and SWE-Bench

Code & Models

Repositories

sparksofagi/mhpp
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsAttention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout