VeruSAGE: A Study of Agent-Based Verification for Rust Systems
Chenyuan Yang, Natalie Neamtu, Chris Hawblitzel, Jacob R. Lorch, Shan Lu

TL;DR
This study evaluates the ability of various large language models to develop correctness proofs for Rust system software, introducing a new benchmark suite and agent systems to enhance verification performance.
Contribution
It presents VeruSAGE-Bench, a new benchmark for Rust verification, and designs specialized agent systems to improve LLMs' proof development capabilities.
Findings
The best LLM-agent combo completes over 80% of verification tasks.
Over 90% of additional proof tasks are completed by the best system, surpassing human performance.
Different LLMs require tailored agent configurations for optimal verification results.
Abstract
Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
