PyBench: Evaluating LLM Agent on various real-world coding tasks
Yaolun Zhang, Yinxu Pan, Yudong Wang, Jie Cai

TL;DR
PyBench is a comprehensive benchmark designed to evaluate large language model agents on diverse real-world Python coding tasks, highlighting current limitations and demonstrating the effectiveness of a fine-tuned 8B model.
Contribution
Introduces PyBench, a new benchmark covering various real-world coding tasks, and presents PyLlama3, a fine-tuned model that outperforms larger models on this benchmark.
Findings
Current open-source LLMs struggle with PyBench tasks.
PyLlama3 surpasses larger models in performance.
Comprehensive reasoning and feedback incorporation are essential.
Abstract
The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
