CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis
Anushri Eswaran, Oleg Golev, Darshan Tank, Sidhant Rahi, and Himanshu Tyagi

TL;DR
This paper introduces CryptoAnalystBench, a benchmark for analyzing failures in large language models when reasoning over complex crypto-related data, revealing persistent errors that impact high-stakes decision-making.
Contribution
It presents a new benchmark, an evaluation pipeline, and a taxonomy of failure modes for LLMs handling multi-tool, high-density data in the crypto domain.
Findings
Failures persist in state-of-the-art systems affecting decision quality
A taxonomy of seven higher order error types was developed
The evaluation rubric reliably detects critical failure modes
Abstract
Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling benchmarks and examined factuality in knowledge augmented systems, relatively little work studies their intersection: settings where LLMs must integrate large volumes of dynamic, structured and unstructured multi tool outputs. We investigate LLM failure modes in this regime using crypto as a representative high data density domain. We introduce (1) CryptoAnalystBench, an analyst aligned benchmark of 198 production crypto and DeFi queries spanning 11 categories; (2) an agentic harness equipped with relevant crypto and DeFi tools to generate responses across multiple frontier LLMs; and (3) an evaluation pipeline with citation verification and an LLM as a judge rubric spanning four user defined success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Intelligence, Security, War Strategy · Digital and Cyber Forensics
