ML-Bench: Evaluating Large Language Models and Agents for Machine   Learning Tasks on Repository-Level Code

Xiangru Tang; Yuliang Liu; Zefan Cai; Yanjun Shao; Junjie Lu; Yichi; Zhang; Zexuan Deng; Helan Hu; Kaikai An; Ruijun Huang; Shuzheng Si; Sheng; Chen; Haozhe Zhao; Liang Chen; Yan Wang; Tianyu Liu; Zhiwei Jiang; Baobao; Chang; Yin Fang; Yujia Qin; Wangchunshu Zhou; Yilun Zhao; Arman Cohan; Mark; Gerstein

arXiv:2311.09835·cs.CL·August 22, 2024·1 cites

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi, Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng, Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao, Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao

PDF

Open Access 1 Repo 2 Datasets

TL;DR

ML-Bench is a comprehensive benchmark designed to evaluate large language models and AI agents on real-world, repository-level coding tasks, highlighting current capabilities and areas for improvement in understanding complex code interactions.

Contribution

The paper introduces ML-Bench, a novel benchmark with real-world code examples for evaluating LLMs and agents on repository-scale code understanding and execution tasks.

Findings

01

GPT-4o achieves over 50% Pass@5 in code generation tasks.

02

GPT-4o attains a 76.47% success rate in autonomous agent tasks.

03

Significant challenges remain, including hallucinations and bash script generation issues.

Abstract

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gersteinlab/ml-bench
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Ferroelectric and Negative Capacitance Devices

MethodsMulti-Head Attention · Attention Is All You Need · Adam · Softmax · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Residual Connection