ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

Xingwei He; Qianru Zhang; Pengfei Chen; Guanhua Chen; Linlin Yu; Yuan Yuan; Siu-Ming Yiu

arXiv:2511.14342·cs.CL·November 20, 2025

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

Xingwei He, Qianru Zhang, Pengfei Chen, Guanhua Chen, Linlin Yu, Yuan Yuan, Siu-Ming Yiu

PDF

Open Access 1 Video

TL;DR

This paper introduces ConInstruct, a benchmark for evaluating large language models' ability to detect and resolve conflicting instructions, revealing strengths in conflict detection but weaknesses in user notification and clarification.

Contribution

The paper presents a new benchmark, ConInstruct, to assess LLMs' conflict detection and resolution, filling a gap in evaluating their behavior with conflicting instructions.

Findings

01

Most proprietary LLMs detect conflicts well.

02

Open-source models vary in conflict detection performance.

03

LLMs rarely notify users or seek clarification about conflicts.

Abstract

Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Software Engineering Research