Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All

Chenxi Huang; Alex Mathai; Feiyang Yu; Aleksandr Nogikh; Petros Maniatis; Franjo Ivan\v{c}i\'c; Eugene Wu; Kostis Kaffes; Junfeng Yang; Baishakhi Ray

arXiv:2602.02690·cs.SE·February 16, 2026

Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All

Chenxi Huang, Alex Mathai, Feiyang Yu, Aleksandr Nogikh, Petros Maniatis, Franjo Ivan\v{c}i\'c, Eugene Wu, Kostis Kaffes, Junfeng Yang, Baishakhi Ray

PDF

Open Access

TL;DR

This paper introduces Live-kBench, a dynamic evaluation framework and kEnv environment for benchmarking LLM-based Linux kernel crash-resolution agents on evolving bugs, highlighting performance gaps and improvements in crash fix rates.

Contribution

It presents a novel self-evolving benchmark framework and standardized environment for fair, scalable evaluation of crash-resolution agents on real-time Linux kernel bugs.

Findings

01

Agents achieve up to 25% higher patch rate before LLM knowledge cutoff.

02

State-of-the-art agents resolve 74% of crashes on first attempt.

03

Exposing crash feedback improves resolution rate by 29%.

Abstract

Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Software Testing and Debugging Techniques · Advanced Data Storage Technologies