When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering
Yikun Han, Mengfei Lan, Halil Kilicoglu

TL;DR
This paper evaluates biomedical retrieval-augmented LLMs under conflicting evidence scenarios, revealing accuracy drops and proposing a conflict-aware abstention method to improve reliability.
Contribution
It introduces a systematic evaluation of LLMs with conflicting biomedical evidence and proposes an abstention score that enhances decision reliability.
Findings
Accuracy drops when evidence order is reversed.
A conflict-aware abstention score improves selective accuracy.
Conflicting evidence impacts both uncertainty and robustness.
Abstract
Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
