LATTICE: Evaluating Decision Support Utility of Crypto Agents
Aaron Chan, Tengfei Li, Tianyi Xiao, Angela Chen, Junyi Du, Xiang Ren

TL;DR
LATTICE is a scalable benchmark using LLM judges to evaluate the decision support utility of crypto agents across multiple dimensions and tasks, reflecting real-world crypto copilot scenarios.
Contribution
It introduces a novel, scalable evaluation framework for crypto agents that assesses decision support quality without relying on ground truth or external data.
Findings
Most crypto copilots have similar overall scores but differ significantly on specific dimensions.
The evaluation reveals trade-offs in decision support quality across different copilots.
LATTICE enables continuous, extensible assessment through open-source code and data.
Abstract
We introduce LATTICE, a benchmark for evaluating the decision support utility of crypto agents in realistic user-facing scenarios. Prior crypto agent benchmarks mainly focus on reasoning-based or outcome-based evaluation, but do not assess agents' ability to assist user decision-making. LATTICE addresses this gap by: (1) defining six evaluation dimensions that capture key decision support properties; (2) proposing 16 task types that span the end-to-end crypto copilot workflow; and (3) using LLM judges to automatically score agent outputs based on these dimensions and tasks. Crucially, the dimensions and tasks are designed to be evaluable at scale using LLM judges, without relying on ground truth from expert annotators or external data sources. In lieu of these dependencies, LATTICE's LLM judge rubrics can be continually audited and updated given new dimensions, tasks, criteria, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
