Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with   Citations

Maya Patel; Aditi Anand

arXiv:2412.18051·cs.CL·December 25, 2024

Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations

Maya Patel, Aditi Anand

PDF

Open Access

TL;DR

This paper benchmarks modern large language models on ambiguous question answering tasks with citations, revealing their strengths and weaknesses in handling multiple answers and citation accuracy, and proposes prompt-based improvements.

Contribution

It introduces a comprehensive benchmark for evaluating LLMs on ambiguous QA with citations and demonstrates how conflict-aware prompting enhances their performance.

Findings

01

Models predict at least one correct answer in ambiguous contexts.

02

Models perform poorly in citation generation with 0 accuracy.

03

Conflict-aware prompting improves answer and citation performance.

Abstract

Benchmarking modern large language models (LLMs) on complex and realistic tasks is critical to advancing their development. In this work, we evaluate the factual accuracy and citation performance of state-of-the-art LLMs on the task of Question Answering (QA) in ambiguous settings with source citations. Using three recently published datasets-DisentQA-DupliCite, DisentQA-ParaCite, and AmbigQA-Cite-featuring a range of real-world ambiguities, we analyze the performance of two leading LLMs, GPT-4o-mini and Claude-3.5. Our results show that larger, recent models consistently predict at least one correct answer in ambiguous contexts but fail to handle cases with multiple valid answers. Additionally, all models perform equally poorly in citation generation, with citation accuracy consistently at 0. However, introducing conflict-aware prompting leads to large improvements, enabling models to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations