TL;DR
s2n-bignum-bench is a new benchmark designed to evaluate LLMs' ability to generate formal proofs for industrial cryptographic assembly routines verified in HOL Light, bridging the gap between theorem proving and real-world code verification.
Contribution
It introduces the first public benchmark for machine-checkable proof synthesis of cryptographic assembly routines, emphasizing practical verification challenges.
Findings
Provides formal specifications for cryptographic routines
Challenges LLMs to generate accepted proof scripts in HOL Light
Establishes a new standard for evaluating LLMs in industrial formal verification
Abstract
Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
