A Benchmark for Language Models in Real-World System Building
Weilin Jin, Chenyu Zhao, Zeshun Huang, Chaoyun Zhang, Qingwei Lin, Chetan Bansal, Saravan Rajmohan, Shenglin Zhang, Yongqian Sun, Dan Pei, Yifan Wu, Tong Jia, Ying Li, Zhonghai Wu, Minghua Ma

TL;DR
This paper introduces a new benchmark for evaluating large language models on their ability to repair software package builds across diverse architectures and languages, highlighting ongoing challenges in cross-ISA software repair.
Contribution
It presents a comprehensive benchmark with real-world failures for cross-architecture software build repair, enabling standardized evaluation of LLMs in this domain.
Findings
Cross-ISA repair remains challenging for current LLMs
Benchmark includes 268 real-world failure cases
Evaluation of six state-of-the-art LLMs shows limited success
Abstract
During migration across instruction set architectures (ISAs), software package build repair is a critical task for ensuring the reliability of software deployment and the stability of modern operating systems. While Large Language Models (LLMs) have shown promise in tackling this challenge, prior work has primarily focused on single instruction set architecture (ISA) and homogeneous programming languages. To address this limitation, we introduce a new benchmark designed for software package build repair across diverse architectures and languages. Comprising 268 real-world software package build failures, the benchmark provides a standardized evaluation pipeline. We evaluate six state-of-the-art LLMs on the benchmark, and the results show that cross-ISA software package repair remains difficult and requires further advances. By systematically exposing this challenge, the benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Radiation Effects in Electronics
