Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais

TL;DR
This paper argues that evaluating alignment solely at the model level is insufficient and proposes a system-level evaluation approach that considers response, interaction, and deployment contexts.
Contribution
It introduces a comprehensive audit of benchmarks, highlights limitations of model-level evaluation, and proposes a system-level evaluation framework for alignment assessment.
Findings
User-facing verification support is absent in all examined benchmarks.
Interactional benchmarks are fragmented and coverage depends on construction.
Verification scaffold efficacy varies across models, showing model dependence.
Abstract
Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
