FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

Junyu Luo; Zhizhuo Kou; Liming Yang; Xiao Luo; Jinsheng Huang; Zhiping Xiao; Jingshu Peng; Chengzhong Liu; Jiaming Ji; Xuanzhe Liu; Sirui Han; Ming Zhang; Yike Guo

arXiv:2505.24714·cs.CL·June 2, 2025

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Jiaming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, Yike Guo

PDF

Open Access 1 Repo 1 Datasets

TL;DR

FinMME is a comprehensive benchmark dataset designed to evaluate multimodal large language models in the financial domain, addressing the lack of specialized evaluation tools and highlighting the challenges faced by current models.

Contribution

The paper introduces FinMME, a large-scale, high-quality financial multimodal dataset and FinScore, an evaluation system, to advance and standardize MLLM development in finance.

Findings

01

State-of-the-art models perform poorly on FinMME.

02

The dataset is highly robust with prediction variations below 1%.

03

FinMME reveals the current limitations of MLLMs in financial reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luo-junyu/finmme
noneOfficial

Datasets

luojunyu/FinMME
dataset· 367 dl
367 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods