EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements
Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa, Kosuke Nakago, David Ha

TL;DR
EDINET-Bench is a new open-source Japanese financial benchmark that evaluates LLMs on complex financial tasks requiring expert reasoning, revealing current models' limitations and emphasizing the need for more sophisticated evaluation frameworks.
Contribution
Introduces EDINET-Bench, a comprehensive Japanese financial benchmark for LLMs, highlighting the challenges and gaps in current models' performance on complex financial tasks.
Findings
LLMs perform only marginally better than logistic regression on financial tasks
Current LLMs struggle with processing entire financial reports and integrating information
Simple report provision is insufficient; richer scaffolding is needed for effective financial reasoning.
Abstract
Large Language Models (LLMs) have made remarkable progress, surpassing human performance on several benchmarks in domains such as mathematics and coding. A key driver of this progress has been the development of benchmark datasets. In contrast, the financial domain poses higher entry barriers due to its demand for specialized expertise, and benchmarks remain relatively scarce compared to those in mathematics or coding. We introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate LLMs on challenging tasks such as accounting fraud detection, earnings forecasting, and industry classification. EDINET-Bench is constructed from ten years of annual reports filed by Japanese companies. These tasks require models to process entire annual reports and integrate information across multiple tables and textual sections, demanding expert-level reasoning that is…
Peer Reviews
Decision·ICLR 2026 Poster
1. This would be a large-scale Japanese financial benchmark requiring expert reasoning. So the model looks like a good work delivered. 2. Covers diverse, realistic financial tasks. The real world scenario based tasks being financial in nature would be good. 3. Includes reproducible toolkit and dataset which have other applications as well
1. LLMs show limited performance gains over traditional models. SLMs could also be explored in depth for the tasks. Even MCP route with more sophisticated approaches can be taken into consideration 2. Some of the Evaluation limited to zero-shot; no fine-tuning comparisons. Zero shot learnings in some cases can be a good way to go but cannot be relied on all the time.
Clear gap & relevance. Financial NLP benchmarks often emphasize QA/extraction (e.g., FinQA/ConvFinQA/FinanceBench; largely English), whereas this work targets end-to-end expert reasoning over full reports in Japanese—a meaningful and under-served setting. Nontrivial tasks and inputs. Whole-report inputs (tables + text) and tasks like fraud detection are practically impactful and distinct from prior QA settings and multimodal finance QA (e.g., FAMMA). Open tooling & dataset construction detai
Fraud labels rely on LLM-assisted screening of amended reports. Although later manually checked, the pipeline first classifies “amended report reasons” with Claude and then claims <5% label errors. This creates potential circularity (LLM both builds and is evaluated on the dataset domain) and possible systematic biases in what counts as fraud (e.g., wording styles of amendments). A larger human validation and inter-rater agreement would strengthen validity. Definition of “fraud” is ambiguous. A
- Real-world, long-context financial documents (tables + text) with open data, code, and tooling for reproducibility. - Clear task definitions and splits; systematic evaluation across models and input modalities, plus contamination checks. - Practical insights (e.g., text helps fraud detection but not earnings forecasting) that inform future benchmark and agentic framework design.
- The benchmark is limited to the Japanese financial market. But I believe the methodology of the benchmark is applicable to other financial markets. Results may be different and interesting in other markets (I guess the industry prediction task results may vary). - The industry prediction task, while straightforward to evaluate, does not closely reflect real-world use cases. In practice, industry labels are readily available from official sources, so predicting them from financial statements h
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFinancial Distress and Bankruptcy Prediction · Stock Market Forecasting Methods · Auditing, Earnings Management, Governance
MethodsLogistic Regression
