DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness
Jiabao Ji, Min Li, Priyanshu Kumar, Shiyu Chang, Saloni Potdar

TL;DR
DeepAmbigQA introduces a challenging benchmark dataset with ambiguous multi-hop questions to evaluate and improve the answer completeness of large language models, revealing current models' limitations in handling ambiguity and complex reasoning.
Contribution
The paper presents DeepAmbigQA, a novel dataset and data generation pipeline for evaluating LLMs on ambiguous, multi-hop questions requiring complex reasoning.
Findings
GPT-5 achieves only 0.13 exact match on ambiguous questions
Models struggle with answer completeness in complex, ambiguous questions
DeepAmbigQA exposes gaps in current LLM reasoning capabilities
Abstract
Large language models (LLMs) with integrated search tools show strong promise in open-domain question answering (QA), yet they often struggle to produce complete answer set to complex questions such as Which actor from the film Heat won at least one Academy Award?, which requires (1) distinguishing between multiple films sharing the same title and (2) reasoning across a large set of actors to gather and integrate evidence. Existing QA benchmarks rarely evaluate both challenges jointly. To address this, we introduce DeepAmbigQAGen, an automatic data generation pipeline that constructs QA tasks grounded in text corpora and linked knowledge graph, generating natural and verifiable questions that systematically embed name ambiguity and multi-step reasoning. Based on this, we build DeepAmbigQA, a dataset of 3,600 questions requiring multi-hop reasoning and half of them explicit name…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
