KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution
Alex Mathai, Chenxi Huang, Petros Maniatis, Aleksandr Nogikh, Franjo, Ivancic, Junfeng Yang, Baishakhi Ray

TL;DR
This paper introduces kGym and kBench, a platform and dataset for benchmarking large language models on Linux kernel crash resolution, highlighting current performance gaps and future research directions.
Contribution
The paper presents a novel platform and dataset for evaluating LLMs on Linux kernel crash resolution, enabling large-scale experiments and benchmarking in systems software.
Findings
Best LLM achieves 0.72% accuracy unassisted
Best LLM achieves 5.38% accuracy with assistance
Current models perform poorly on complex systems software tasks
Abstract
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. Unlike application-level software, a systems codebase like Linux is multilingual (low-level C/Assembly/Bash/Rust); gigantic (>20 million lines); critical (impacting billions of devices worldwide), and highly concurrent (involving complex multi-threading). To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym (a platform) and kBench (a dataset). The kGym platform provides a SE environment for large-scale experiments on the Linux kernel, including compiling and running kernels in parallel across several virtual machines, detecting operations and crashes, inspecting logs, and querying and patching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware System Performance and Reliability · Advanced Data Processing Techniques
