HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang

TL;DR
This paper introduces self-invoking code generation, a new evaluation task for LLMs that tests their reasoning by solving a base problem and then using its solution to solve a more complex one, along with new benchmarks.
Contribution
It proposes a general method to create challenging benchmarks for self-invoking code generation and analyzes LLM performance, revealing limitations and failure modes.
Findings
LLMs perform well on traditional benchmarks but less so on self-invoking tasks.
Instruction-tuned models show only marginal improvements over base models.
Performance drops significantly on self-invoking benchmarks compared to traditional ones.
Abstract
We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling
MethodsBalanced Selection
