CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis

Anushri Eswaran; Oleg Golev; Darshan Tank; Sidhant Rahi; and Himanshu Tyagi

arXiv:2602.11304·cs.IR·March 26, 2026

CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis

Anushri Eswaran, Oleg Golev, Darshan Tank, Sidhant Rahi, and Himanshu Tyagi

PDF

Open Access

TL;DR

This paper introduces CryptoAnalystBench, a benchmark for analyzing failures in large language models when reasoning over complex crypto-related data, revealing persistent errors that impact high-stakes decision-making.

Contribution

It presents a new benchmark, an evaluation pipeline, and a taxonomy of failure modes for LLMs handling multi-tool, high-density data in the crypto domain.

Findings

01

Failures persist in state-of-the-art systems affecting decision quality

02

A taxonomy of seven higher order error types was developed

03

The evaluation rubric reliably detects critical failure modes

Abstract

Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling benchmarks and examined factuality in knowledge augmented systems, relatively little work studies their intersection: settings where LLMs must integrate large volumes of dynamic, structured and unstructured multi tool outputs. We investigate LLM failure modes in this regime using crypto as a representative high data density domain. We introduce (1) CryptoAnalystBench, an analyst aligned benchmark of 198 production crypto and DeFi queries spanning 11 categories; (2) an agentic harness equipped with relevant crypto and DeFi tools to generate responses across multiple frontier LLMs; and (3) an evaluation pipeline with citation verification and an LLM as a judge rubric spanning four user defined success…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Intelligence, Security, War Strategy · Digital and Cyber Forensics