WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge   Conflicts from Wikipedia

Yufang Hou; Alessandra Pascale; Javier Carnerero-Cano; Tigran; Tchrakian; Radu Marinescu; Elizabeth Daly; Inkit Padhi; Prasanna Sattigeri

arXiv:2406.13805·cs.CL·June 21, 2024·1 cites

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran, Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces WikiContradict, a benchmark dataset for evaluating how well large language models handle real-world knowledge conflicts from Wikipedia, revealing models' limitations in reasoning about contradictory information.

Contribution

The paper presents WikiContradict, a new benchmark dataset with human-annotated instances to assess LLM performance on conflicting knowledge, along with an automated evaluation method.

Findings

01

LLMs struggle to accurately reflect conflicting facts from Wikipedia.

02

Models have difficulty with implicit conflicts requiring reasoning.

03

Automated evaluation achieves an F-score of 0.8, enabling scalable assessment.

Abstract

Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ibm-research/Wikipedia_contradict_benchmark
dataset· 224 dl
224 dl

Videos

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia· slideslive

Taxonomy

TopicsWikis in Education and Collaboration · Digital Rights Management and Security

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Residual Connection · Weight Decay · Softmax · Layer Normalization · Byte Pair Encoding · Attention Dropout · Linear Warmup With Linear Decay