TL;DR
MAGIC introduces a graph-based benchmark to evaluate how well large language models detect and handle complex, multi-hop knowledge conflicts across diverse contexts, revealing current limitations in conflict detection and source identification.
Contribution
The paper presents MAGIC, a novel KG-based benchmark for multi-hop, inter-context conflicts in RAG, addressing limitations of prior benchmarks and providing insights into LLMs' conflict detection abilities.
Findings
Models struggle with conflict detection, especially in multi-hop scenarios.
Models often fail to identify the exact source of contradictions.
The benchmark reveals significant gaps in current LLM capabilities.
Abstract
Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model's parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection -- especially when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
