FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering

Chanyeol Choi; Jihoon Kwon; Alejandro Lopez-Lira; Chaewoon Kim; Minjae Kim; Juneha Hwang; Jaeseon Ha; Hojun Choi; Suyeol Yun; Yongjin Kim; and Yongjae Lee

arXiv:2508.14052·cs.IR·October 6, 2025

FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering

Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi, Suyeol Yun, Yongjin Kim, and Yongjae Lee

PDF

Open Access

TL;DR

FinAgentBench is a large-scale benchmark dataset designed to evaluate multi-step, agentic retrieval capabilities of language models in the financial domain, addressing a critical gap in domain-specific information retrieval evaluation.

Contribution

It introduces the first benchmark for agentic retrieval in finance, with a dataset of 26K annotated examples, and evaluates models' ability to identify relevant documents and key passages.

Findings

01

State-of-the-art models show room for improvement in agentic retrieval.

02

Targeted fine-tuning significantly enhances retrieval performance.

03

Benchmark enables detailed analysis of LLM behavior in finance retrieval tasks.

Abstract

Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods -- whether sparse or dense -- often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine-grained reasoning over document structure and domain-specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi-step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance -- a setting we term…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods · Topic Modeling · Information Retrieval and Search Behavior