Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought
Zichen Xie, Wenxi Wang

TL;DR
This paper introduces VCoT-Bench, a benchmark for evaluating LLMs' reasoning in Rust verification through explicit verification chains, revealing current models' fragility and limited understanding.
Contribution
The paper presents VCoT-Lift and VCoT-Bench, enabling fine-grained, chain-of-thought evaluation of LLMs' Rust verification reasoning capabilities.
Findings
Current LLMs show significant fragility in Rust verification tasks.
Evaluation reveals models' limited understanding of logical deduction in verification.
Benchmark provides detailed insights into models' strengths and weaknesses.
Abstract
As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs' understanding of the entire verification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, programming, and type systems · Security and Verification in Computing · Web Application Security Vulnerabilities
