TL;DR
EVM-QuestBench is a new execution-grounded benchmark for evaluating natural-language transaction code generation on EVM-compatible chains, emphasizing execution accuracy and safety.
Contribution
It introduces a modular, dynamic evaluation benchmark with 107 tasks and a runner for execution on forked EVM chains, highlighting performance gaps in current models.
Findings
20 models evaluated show large performance gaps.
Split scores reveal asymmetry between single-action and multi-step tasks.
Benchmark enables rapid development and evaluation of transaction scripts.
Abstract
Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
