EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Issa Sugiura; Takashi Ishida; Taro Makino; Chieko Tazuke; Takanori Nakagawa; Kosuke Nakago; David Ha

arXiv:2506.08762·q-fin.ST·March 6, 2026

EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa, Kosuke Nakago, David Ha

PDF

Open Access 2 Repos 1 Datasets 3 Reviews

TL;DR

EDINET-Bench is a new open-source Japanese financial benchmark that evaluates LLMs on complex financial tasks requiring expert reasoning, revealing current models' limitations and emphasizing the need for more sophisticated evaluation frameworks.

Contribution

Introduces EDINET-Bench, a comprehensive Japanese financial benchmark for LLMs, highlighting the challenges and gaps in current models' performance on complex financial tasks.

Findings

01

LLMs perform only marginally better than logistic regression on financial tasks

02

Current LLMs struggle with processing entire financial reports and integrating information

03

Simple report provision is insufficient; richer scaffolding is needed for effective financial reasoning.

Abstract

Large Language Models (LLMs) have made remarkable progress, surpassing human performance on several benchmarks in domains such as mathematics and coding. A key driver of this progress has been the development of benchmark datasets. In contrast, the financial domain poses higher entry barriers due to its demand for specialized expertise, and benchmarks remain relatively scarce compared to those in mathematics or coding. We introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate LLMs on challenging tasks such as accounting fraud detection, earnings forecasting, and industry classification. EDINET-Bench is constructed from ten years of annual reports filed by Japanese companies. These tasks require models to process entire annual reports and integrate information across multiple tables and textual sections, demanding expert-level reasoning that is…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. This would be a large-scale Japanese financial benchmark requiring expert reasoning. So the model looks like a good work delivered. 2. Covers diverse, realistic financial tasks. The real world scenario based tasks being financial in nature would be good. 3. Includes reproducible toolkit and dataset which have other applications as well

Weaknesses

1. LLMs show limited performance gains over traditional models. SLMs could also be explored in depth for the tasks. Even MCP route with more sophisticated approaches can be taken into consideration 2. Some of the Evaluation limited to zero-shot; no fine-tuning comparisons. Zero shot learnings in some cases can be a good way to go but cannot be relied on all the time.

Reviewer 02Rating 6Confidence 3

Strengths

Clear gap & relevance. Financial NLP benchmarks often emphasize QA/extraction (e.g., FinQA/ConvFinQA/FinanceBench; largely English), whereas this work targets end-to-end expert reasoning over full reports in Japanese—a meaningful and under-served setting. Nontrivial tasks and inputs. Whole-report inputs (tables + text) and tasks like fraud detection are practically impactful and distinct from prior QA settings and multimodal finance QA (e.g., FAMMA). Open tooling & dataset construction detai

Weaknesses

Fraud labels rely on LLM-assisted screening of amended reports. Although later manually checked, the pipeline first classifies “amended report reasons” with Claude and then claims <5% label errors. This creates potential circularity (LLM both builds and is evaluated on the dataset domain) and possible systematic biases in what counts as fraud (e.g., wording styles of amendments). A larger human validation and inter-rater agreement would strengthen validity. Definition of “fraud” is ambiguous. A

Reviewer 03Rating 6Confidence 5

Strengths

- Real-world, long-context financial documents (tables + text) with open data, code, and tooling for reproducibility. - Clear task definitions and splits; systematic evaluation across models and input modalities, plus contamination checks. - Practical insights (e.g., text helps fraud detection but not earnings forecasting) that inform future benchmark and agentic framework design.

Weaknesses

- The benchmark is limited to the Japanese financial market. But I believe the methodology of the benchmark is applicable to other financial markets. Results may be different and interesting in other markets (I guess the industry prediction task results may vary). - The industry prediction task, while straightforward to evaluate, does not closely reflect real-world use cases. In practice, industry labels are readily available from official sources, so predicting them from financial statements h

Code & Models

Repositories

Datasets

SakanaAI/EDINET-Bench
dataset· 637 dl
637 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFinancial Distress and Bankruptcy Prediction · Stock Market Forecasting Methods · Auditing, Earnings Management, Governance

MethodsLogistic Regression