Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering

Eviatar Nachshoni; Arie Cattan; Shmuel Amar; Ori Shapira; and Ido Dagan

arXiv:2508.12355·cs.CL·August 19, 2025

Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering

Eviatar Nachshoni, Arie Cattan, Shmuel Amar, Ori Shapira, and Ido Dagan

PDF

Open Access

TL;DR

This paper introduces NATCONFQA, a new benchmark for conflict-aware multi-answer question answering, highlighting the challenges LLMs face in identifying and resolving conflicting answers in realistic scenarios.

Contribution

The paper presents a novel, cost-effective methodology to create a realistic benchmark with conflict labels and evaluates LLMs' performance on it, revealing their limitations.

Findings

01

LLMs struggle with conflict detection in MAQA.

02

Existing models often fail to identify conflicting answer pairs.

03

NATCONFQA provides a more realistic evaluation of LLMs in conflict scenarios.

Abstract

Large Language Models (LLMs) have demonstrated strong performance in question answering (QA) tasks. However, Multi-Answer Question Answering (MAQA), where a question may have several valid answers, remains challenging. Traditional QA settings often assume consistency across evidences, but MAQA can involve conflicting answers. Constructing datasets that reflect such conflicts is costly and labor-intensive, while existing benchmarks often rely on synthetic data, restrict the task to yes/no questions, or apply unverified automated annotation. To advance research in this area, we extend the conflict-aware MAQA setting to require models not only to identify all valid answers, but also to detect specific conflicting answer pairs, if any. To support this task, we introduce a novel cost-effective methodology for leveraging fact-checking datasets to construct NATCONFQA, a new benchmark for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Speech and dialogue systems · Topic Modeling