MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

Teng Lin; Yuyu Luo; Honglin Zhang; Jicheng Zhang; Chunlin Liu; Kaishun Wu; Nan Tang

arXiv:2502.18993·cs.CL·September 25, 2025

MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

Teng Lin, Yuyu Luo, Honglin Zhang, Jicheng Zhang, Chunlin Liu, Kaishun Wu, Nan Tang

PDF

1 Video

TL;DR

MEBench is a comprehensive benchmark designed to evaluate large language models' ability to perform multi-entity question answering across multiple documents, revealing current limitations and guiding future improvements.

Contribution

We introduce MEBench, a new multi-document, multi-entity benchmark with 4,780 questions to systematically assess LLMs' multi-entity reasoning and information consolidation capabilities.

Findings

01

State-of-the-art LLMs achieve only 59% accuracy on MEBench.

02

Current models struggle with cross-document aggregation and entity attribution.

03

MEBench highlights key weaknesses in existing LLM-based QA systems.

Abstract

Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like "What is the distribution of ACM Fellows among various fields of study?", which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs' capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering· underline

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Absolute Position Encodings · Attention Dropout · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing