ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models
Jierui Li, Vipul Raheja, Dhruv Kumar

TL;DR
This paper introduces ContraDoc, a new dataset for studying self-contradictions in long documents, and evaluates the capabilities of leading large language models, revealing that even the best models are still unreliable on nuanced contradictions.
Contribution
The paper presents the first human-annotated dataset for self-contradictions in long documents and analyzes the performance of top LLMs on this challenging task.
Findings
GPT4 outperforms other models and humans on the dataset.
All models struggle with nuanced and context-dependent contradictions.
Models are unreliable in detecting complex self-contradictions.
Abstract
In recent times, large language models (LLMs) have shown impressive performance on various document-level tasks such as document classification, summarization, and question-answering. However, research on understanding their capabilities on the task of self-contradictions in long documents has been very limited. In this work, we introduce ContraDoc, the first human-annotated dataset to study self-contradictions in long documents across multiple domains, varying document lengths, self-contradictions types, and scope. We then analyze the current capabilities of four state-of-the-art open-source and commercially available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset. While GPT4 performs the best and can outperform humans on this task, we find that it is still unreliable and struggles with self-contradictions that require more nuance and context. We release the dataset and all the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
