EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Pei Yang; Wanyi Chen; Ke Wang; Lynn Ai; Eric Yang; Tianyu Shi

arXiv:2601.06565·cs.CL·April 8, 2026

EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang, Tianyu Shi

PDF

1 Repo

TL;DR

EVM-QuestBench is a new execution-grounded benchmark for evaluating natural-language transaction code generation on EVM-compatible chains, emphasizing execution accuracy and safety.

Contribution

It introduces a modular, dynamic evaluation benchmark with 107 tasks and a runner for execution on forked EVM chains, highlighting performance gaps in current models.

Findings

01

20 models evaluated show large performance gaps.

02

Split scores reveal asymmetry between single-action and multi-step tasks.

03

Benchmark enables rapid development and evaluation of transaction scripts.

Abstract

Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/bsc_quest_bench-A9CF
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.