FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models
Andrew Zhu, Alyssa Hwang, Liam Dugan, Chris Callison-Burch

TL;DR
FanOutQA introduces a new dataset and benchmark for evaluating large language models on complex multi-hop, multi-document reasoning questions, revealing current models' limitations in inter-document reasoning.
Contribution
The paper presents FanOutQA, a high-quality dataset and benchmark for multi-hop, multi-document reasoning, along with human-annotated decompositions, to evaluate and improve LLMs.
Findings
Contemporary LLMs still struggle with inter-document reasoning.
Benchmark results show room for improvement in multi-hop question answering.
The dataset and tools are publicly available for further research.
Abstract
One type of question that is commonly found in day-to-day scenarios is ``fan-out'' questions, complex multi-hop, multi-document reasoning questions that require finding information about a large number of entities. However, there exist few resources to evaluate this type of question-answering capability among large language models. To evaluate complex reasoning in LLMs more fully, we present FanOutQA, a high-quality dataset of fan-out question-answer pairs and human-annotated decompositions with English Wikipedia as the knowledge base. We formulate three benchmark settings across our dataset and benchmark 7 LLMs, including GPT-4, LLaMA 2, Claude-2.1, and Mixtral-8x7B, finding that contemporary models still have room to improve reasoning over inter-document dependencies in a long context. We provide our dataset and open-source tools to run models to encourage evaluation at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Dropout · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection
