HumanEval Pro and MBPP Pro: Evaluating Large Language Models on   Self-invoking Code Generation

Zhaojian Yu; Yilun Zhao; Arman Cohan; Xiao-Ping Zhang

arXiv:2412.21199·cs.SE·January 3, 2025

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper introduces self-invoking code generation, a new evaluation task for LLMs that tests their reasoning by solving a base problem and then using its solution to solve a more complex one, along with new benchmarks.

Contribution

It proposes a general method to create challenging benchmarks for self-invoking code generation and analyzes LLM performance, revealing limitations and failure modes.

Findings

01

LLMs perform well on traditional benchmarks but less so on self-invoking tasks.

02

Instruction-tuned models show only marginal improvements over base models.

03

Performance drops significantly on self-invoking benchmarks compared to traditional ones.

Abstract

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CodeEval-Pro/CodeEval-Pro
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling

MethodsBalanced Selection