MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks

Lei Zhang; Xin Zhou; Chaoyue He; Di Wang; Yi Wu; Hong Xu; Wei Liu; Chunyan Miao

arXiv:2507.18932·cs.MM·August 18, 2025

MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks

Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, Chunyan Miao

PDF

TL;DR

MMESGBench is a pioneering benchmark dataset designed to evaluate multimodal understanding and complex reasoning in ESG documents, addressing the challenges posed by their diverse structure and multimodal content.

Contribution

It introduces the first comprehensive ESG benchmark with multimodal QA pairs, created through a human-AI collaborative pipeline, enabling better evaluation of AI models in this domain.

Findings

01

Multimodal models outperform text-only baselines on the benchmark.

02

Models show particular strength on visually grounded and cross-page questions.

03

The dataset covers diverse ESG document types and source categories.

Abstract

Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce \textbf{MMESGBench}, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.