Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All
Chenxi Huang, Alex Mathai, Feiyang Yu, Aleksandr Nogikh, Petros Maniatis, Franjo Ivan\v{c}i\'c, Eugene Wu, Kostis Kaffes, Junfeng Yang, Baishakhi Ray

TL;DR
This paper introduces Live-kBench, a dynamic evaluation framework and kEnv environment for benchmarking LLM-based Linux kernel crash-resolution agents on evolving bugs, highlighting performance gaps and improvements in crash fix rates.
Contribution
It presents a novel self-evolving benchmark framework and standardized environment for fair, scalable evaluation of crash-resolution agents on real-time Linux kernel bugs.
Findings
Agents achieve up to 25% higher patch rate before LLM knowledge cutoff.
State-of-the-art agents resolve 74% of crashes on first attempt.
Exposing crash feedback improves resolution rate by 29%.
Abstract
Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Software Testing and Debugging Techniques · Advanced Data Storage Technologies
