Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations
Maya Patel, Aditi Anand

TL;DR
This paper benchmarks modern large language models on ambiguous question answering tasks with citations, revealing their strengths and weaknesses in handling multiple answers and citation accuracy, and proposes prompt-based improvements.
Contribution
It introduces a comprehensive benchmark for evaluating LLMs on ambiguous QA with citations and demonstrates how conflict-aware prompting enhances their performance.
Findings
Models predict at least one correct answer in ambiguous contexts.
Models perform poorly in citation generation with 0 accuracy.
Conflict-aware prompting improves answer and citation performance.
Abstract
Benchmarking modern large language models (LLMs) on complex and realistic tasks is critical to advancing their development. In this work, we evaluate the factual accuracy and citation performance of state-of-the-art LLMs on the task of Question Answering (QA) in ambiguous settings with source citations. Using three recently published datasets-DisentQA-DupliCite, DisentQA-ParaCite, and AmbigQA-Cite-featuring a range of real-world ambiguities, we analyze the performance of two leading LLMs, GPT-4o-mini and Claude-3.5. Our results show that larger, recent models consistently predict at least one correct answer in ambiguous contexts but fail to handle cases with multiple valid answers. Additionally, all models perform equally poorly in citation generation, with citation accuracy consistently at 0. However, introducing conflict-aware prompting leads to large improvements, enabling models to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations
