DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Jiabao Ji; Min Li; Priyanshu Kumar; Shiyu Chang; Saloni Potdar

arXiv:2511.01323·cs.CL·November 4, 2025

DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Jiabao Ji, Min Li, Priyanshu Kumar, Shiyu Chang, Saloni Potdar

PDF

Open Access

TL;DR

DeepAmbigQA introduces a challenging benchmark dataset with ambiguous multi-hop questions to evaluate and improve the answer completeness of large language models, revealing current models' limitations in handling ambiguity and complex reasoning.

Contribution

The paper presents DeepAmbigQA, a novel dataset and data generation pipeline for evaluating LLMs on ambiguous, multi-hop questions requiring complex reasoning.

Findings

01

GPT-5 achieves only 0.13 exact match on ambiguous questions

02

Models struggle with answer completeness in complex, ambiguous questions

03

DeepAmbigQA exposes gaps in current LLM reasoning capabilities

Abstract

Large language models (LLMs) with integrated search tools show strong promise in open-domain question answering (QA), yet they often struggle to produce complete answer set to complex questions such as Which actor from the film Heat won at least one Academy Award?, which requires (1) distinguishing between multiple films sharing the same title and (2) reasoning across a large set of actors to gather and integrate evidence. Existing QA benchmarks rarely evaluate both challenges jointly. To address this, we introduce DeepAmbigQAGen, an automatic data generation pipeline that constructs QA tasks grounded in text corpora and linked knowledge graph, generating natural and verifiable questions that systematically embed name ambiguity and multi-step reasoning. Based on this, we build DeepAmbigQA, a dataset of 3,600 questions requiring multi-hop reasoning and half of them explicit name…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques