TL;DR
MEBench is a comprehensive benchmark designed to evaluate large language models' ability to perform multi-entity question answering across multiple documents, revealing current limitations and guiding future improvements.
Contribution
We introduce MEBench, a new multi-document, multi-entity benchmark with 4,780 questions to systematically assess LLMs' multi-entity reasoning and information consolidation capabilities.
Findings
State-of-the-art LLMs achieve only 59% accuracy on MEBench.
Current models struggle with cross-document aggregation and entity attribution.
MEBench highlights key weaknesses in existing LLM-based QA systems.
Abstract
Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like "What is the distribution of ACM Fellows among various fields of study?", which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs' capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Absolute Position Encodings · Attention Dropout · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing
