DI-BENCH: Benchmarking Large Language Models on Dependency Inference   with Testable Repositories at Scale

Linghao Zhang; Junhao Wang; Shilin He; Chaoyun Zhang; Yu Kang; Bowen; Li; Jiaheng Wen; Chengxing Xie; Maoquan Wang; Yufan Huang; Elsie Nallipogu,; Qingwei Lin; Yingnong Dang; Saravan Rajmohan; Dongmei Zhang; Qi Zhang

arXiv:2501.13699·cs.CL·January 24, 2025

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

Linghao Zhang, Junhao Wang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen, Li, Jiaheng Wen, Chengxing Xie, Maoquan Wang, Yufan Huang, Elsie Nallipogu,, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

PDF

Open Access

TL;DR

DI-BENCH is a large-scale benchmark designed to evaluate how well Large Language Models can infer software dependencies across multiple programming languages, revealing significant gaps in current model capabilities.

Contribution

Introduces DI-BENCH, a comprehensive benchmark with 581 repositories for assessing LLMs' dependency inference, highlighting the need for improved models in software synthesis.

Findings

01

Current best model achieves only 42.9% execution pass rate

02

Dependency issues cause over 40% of runtime errors in generated repositories

03

DI-BENCH provides a new evaluation perspective for LLM-based software development

Abstract

Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40\% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management