SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications
Abhinav Goel, Agostino Capponi, Alfio Gliozzo, Chaitya Shah

TL;DR
SmartEval is a comprehensive benchmark for assessing the quality of smart contracts generated by LLMs from natural language, validated through multiple studies and covering various aspects of contract correctness and quality.
Contribution
It introduces a new benchmark with a large dataset, evaluation rubric, and validation pipeline for systematic assessment of LLM-generated smart contracts.
Findings
Automated scores align with expert judgment within 0.34 points.
79.4% agreement between LLM auditor and static analyzer.
Generated contracts outperform ground-truth by +8.29 in composite score.
Abstract
We introduce SmartEval, a benchmark for systematically evaluating the quality of Solidity smart contracts generated by large language models (LLMs) from natural language specifications. SmartEval provides a corpus of 9,000 generated contracts paired with expert-written ground-truth implementations drawn from the FSMSCG dataset, a five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, and a reproducible generation-and-evaluation pipeline. To validate the benchmark's reliability, we conduct three independent empirical studies: a five-condition ablation study (N=300 per condition) isolating the contribution of each pipeline component, a human expert evaluation by three Columbia University PhD researchers confirming automated scores align with expert judgment to within 0.34 points, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
