Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

Rahul Baxi

arXiv:2512.17920·cs.CL·December 23, 2025

Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

Rahul Baxi

PDF

Open Access

TL;DR

This paper introduces the CDCT benchmark to separately evaluate constraint compliance and semantic accuracy in LLMs under prompt compression, revealing a universal U-curve pattern and the impact of RLHF on constraint violations.

Contribution

The paper presents the novel CDCT benchmark and uncovers the orthogonal relationship between constraint compliance and semantic accuracy, along with insights into RLHF's role in constraint violations.

Findings

01

Universal U-curve pattern in constraint compliance across compression levels

02

RLHF removal significantly improves constraint compliance

03

Reasoning models outperform efficient models in instruction-following

Abstract

Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)