PyBench: Evaluating LLM Agent on various real-world coding tasks

Yaolun Zhang; Yinxu Pan; Yudong Wang; Jie Cai

arXiv:2407.16732·cs.SE·August 6, 2024

PyBench: Evaluating LLM Agent on various real-world coding tasks

Yaolun Zhang, Yinxu Pan, Yudong Wang, Jie Cai

PDF

1 Repo 1 Models 1 Datasets

TL;DR

PyBench is a comprehensive benchmark designed to evaluate large language model agents on diverse real-world Python coding tasks, highlighting current limitations and demonstrating the effectiveness of a fine-tuned 8B model.

Contribution

Introduces PyBench, a new benchmark covering various real-world coding tasks, and presents PyLlama3, a fine-tuned model that outperforms larger models on this benchmark.

Findings

01

Current open-source LLMs struggle with PyBench tasks.

02

PyLlama3 surpasses larger models in performance.

03

Comprehensive reasoning and feedback incorporation are essential.

Abstract

The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mercury7353/pybench
noneOfficial

Models

🤗
Mercury7353/PyLlama3
model· 2 dl· ♡ 5
2 dl♡ 5

Datasets

Mercury7353/PyInstruct
dataset· 173 dl
173 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus