TL;DR
Intent2Tx introduces a comprehensive benchmark for evaluating how well large language models translate natural language intents into Ethereum blockchain transactions, emphasizing real-world complexity and execution accuracy.
Contribution
It provides a large, real-world, execution-aware benchmark and evaluation framework for assessing LLMs' ability to generate correct on-chain transactions from natural language.
Findings
Scaling and retrieval-augmentation improve logical consistency.
Models struggle with out-of-distribution generalization.
Syntactically valid outputs often fail to produce intended state changes.
Abstract
The emergence of Large Language Models (LLMs) offers a transformative interface for Web3, yet existing benchmarks fail to capture the complexity of translating high-level user intents into functionally correct, state-dependent on-chain transactions. We present \textsc{Intent2Tx}, a high-fidelity benchmark featuring 29,921 single-step and 1,575 multi-step instances meticulously derived from 300 days of real-world Ethereum mainnet traces. Unlike prior works that rely on synthetic instructions, \textsc{Intent2Tx} grounds natural language intents in real-world protocol interactions across 11 categories, including diverse long-tail Decentralized Finance (DeFi) primitives. To enable rigorous evaluation, we propose an execution-aware framework that transcends surface-level text matching by employing differential state analysis on forked mainnet environments. Our extensive evaluation of 16…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
