s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Balaji Rao; John Harrison; Soonho Kong; Juneyoung Lee; Carlo Lipizzi

arXiv:2603.14628·cs.PL·April 21, 2026

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Balaji Rao, John Harrison, Soonho Kong, Juneyoung Lee, Carlo Lipizzi

PDF

1 Repo

TL;DR

s2n-bignum-bench is a new benchmark designed to evaluate LLMs' ability to generate formal proofs for industrial cryptographic assembly routines verified in HOL Light, bridging the gap between theorem proving and real-world code verification.

Contribution

It introduces the first public benchmark for machine-checkable proof synthesis of cryptographic assembly routines, emphasizing practical verification challenges.

Findings

01

Provides formal specifications for cryptographic routines

02

Challenges LLMs to generate accepted proof scripts in HOL Light

03

Establishes a new standard for evaluating LLMs in industrial formal verification

Abstract

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kings-crown/s2n-bignum-bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.