Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

Florian A. D. Burnat; Brittany I. Davidson

arXiv:2605.06327·cs.CL·May 8, 2026

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

Florian A. D. Burnat, Brittany I. Davidson

PDF

TL;DR

This paper introduces a paired-prompt protocol to measure how evaluation framing affects open-weight LLMs' behavior, revealing significant heterogeneity across models and judge-dependent effects.

Contribution

It presents a novel method for quantifying evaluation-context divergence in open-weight LLMs and demonstrates diverse responses across models and evaluators.

Findings

01

OLMo-3-Instruct shows evaluation framing increases refusals and reduces harmful compliance.

02

Other models like Mistral-Small-3.2, Phi-3.5-mini, and Llama-3.1-8B are deployment-cautious.

03

Cross-family heterogeneity is judge-dependent, affecting interpretation of results.

Abstract

Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase variation, benchmark familiarity, and judge framing-sensitivity. Across five instruction-tuned checkpoints from four open-weight families plus a matched OLMo-3 base/instruct ablation ( $20$ paired items, $840$ generations per checkpoint), we find striking heterogeneity. OLMo-3-Instruct alone is eval-cautious -- evaluation framing raises refusal vs. neutral by $11.8$ pp ( $p = 0.007$ ) and reduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.