WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia
Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran, Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri

TL;DR
This paper introduces WikiContradict, a benchmark dataset for evaluating how well large language models handle real-world knowledge conflicts from Wikipedia, revealing models' limitations in reasoning about contradictory information.
Contribution
The paper presents WikiContradict, a new benchmark dataset with human-annotated instances to assess LLM performance on conflicting knowledge, along with an automated evaluation method.
Findings
LLMs struggle to accurately reflect conflicting facts from Wikipedia.
Models have difficulty with implicit conflicts requiring reasoning.
Automated evaluation achieves an F-score of 0.8, enabling scalable assessment.
Abstract
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsWikis in Education and Collaboration · Digital Rights Management and Security
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Residual Connection · Weight Decay · Softmax · Layer Normalization · Byte Pair Encoding · Attention Dropout · Linear Warmup With Linear Decay
