Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
Florian A. D. Burnat, Brittany I. Davidson

TL;DR
This paper introduces a paired-prompt protocol to measure how evaluation framing affects open-weight LLMs' behavior, revealing significant heterogeneity across models and judge-dependent effects.
Contribution
It presents a novel method for quantifying evaluation-context divergence in open-weight LLMs and demonstrates diverse responses across models and evaluators.
Findings
OLMo-3-Instruct shows evaluation framing increases refusals and reduces harmful compliance.
Other models like Mistral-Small-3.2, Phi-3.5-mini, and Llama-3.1-8B are deployment-cautious.
Cross-family heterogeneity is judge-dependent, affecting interpretation of results.
Abstract
Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase variation, benchmark familiarity, and judge framing-sensitivity. Across five instruction-tuned checkpoints from four open-weight families plus a matched OLMo-3 base/instruct ablation ( paired items, generations per checkpoint), we find striking heterogeneity. OLMo-3-Instruct alone is eval-cautious -- evaluation framing raises refusal vs. neutral by pp () and reduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
