Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk
Zichen Chen, Jiaao Chen, Jianda Chen, Misha Sra

TL;DR
This paper argues that evaluating financial LLM agents should prioritize risk assessment over traditional accuracy metrics, highlighting vulnerabilities and proposing stress-testing frameworks for safer deployment.
Contribution
It introduces a risk-focused evaluation framework for financial LLM agents, emphasizing stress-testing and safety metrics over conventional performance scores.
Findings
Uncovered hidden weaknesses in existing LLM agents during stress tests.
Conventional benchmarks overlook critical safety vulnerabilities.
Recommended risk-aware metrics and stress scenarios for future evaluations.
Abstract
Standard benchmarks fixate on how well large language model (LLM) agents perform in finance, yet say little about whether they are safe to deploy. We argue that accuracy metrics and return-based scores provide an illusion of reliability, overlooking vulnerabilities such as hallucinated facts, stale data, and adversarial prompt manipulation. We take a firm position: financial LLM agents should be evaluated first and foremost on their risk profile, not on their point-estimate performance. Drawing on risk-engineering principles, we outline a three-level agenda: model, workflow, and system, for stress-testing LLM agents under realistic failure modes. To illustrate why this shift is urgent, we audit six API-based and open-weights LLM agents on three high-impact tasks and uncover hidden weaknesses that conventional benchmarks miss. We conclude with actionable recommendations for researchers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCorporate Insolvency and Governance · European and International Contract Law
MethodsFocus
