KGym: A Platform and Dataset to Benchmark Large Language Models on Linux   Kernel Crash Resolution

Alex Mathai; Chenxi Huang; Petros Maniatis; Aleksandr Nogikh; Franjo; Ivancic; Junfeng Yang; Baishakhi Ray

arXiv:2407.02680·cs.SE·November 13, 2024

KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution

Alex Mathai, Chenxi Huang, Petros Maniatis, Aleksandr Nogikh, Franjo, Ivancic, Junfeng Yang, Baishakhi Ray

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces kGym and kBench, a platform and dataset for benchmarking large language models on Linux kernel crash resolution, highlighting current performance gaps and future research directions.

Contribution

The paper presents a novel platform and dataset for evaluating LLMs on Linux kernel crash resolution, enabling large-scale experiments and benchmarking in systems software.

Findings

01

Best LLM achieves 0.72% accuracy unassisted

02

Best LLM achieves 5.38% accuracy with assistance

03

Current models perform poorly on complex systems software tasks

Abstract

Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. Unlike application-level software, a systems codebase like Linux is multilingual (low-level C/Assembly/Bash/Rust); gigantic (>20 million lines); critical (impacting billions of devices worldwide), and highly concurrent (involving complex multi-threading). To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym (a platform) and kBench (a dataset). The kGym platform provides a SE environment for large-scale experiments on the Linux kernel, including compiling and running kernels in parallel across several virtual machines, detecting operations and crashes, inspecting logs, and querying and patching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Alex-Mathai-98/kGym-Kernel-Playground
none

Videos

kGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution· slideslive

Taxonomy

TopicsSoftware System Performance and Reliability · Advanced Data Processing Techniques