Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety
David Gringras

TL;DR
This study investigates how different scaffolding techniques in language model safety evaluations affect measured safety, revealing that evaluation format and architecture significantly influence safety scores and rankings.
Contribution
It provides one of the largest controlled analyses of scaffold effects on safety, highlighting measurement issues and the importance of standardized evaluation formats.
Findings
Scaffold architectures can degrade safety scores by up to 14 points.
Switching from multiple-choice to open-ended formats shifts safety scores by 5-20 percentage points.
Model safety rankings vary widely across benchmarks, with no reliable composite safety index.
Abstract
Safety benchmarks evaluate language models in isolation, typically using multiple-choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre-registration, assessor blinding, equivalence testing, and specification curve analysis. Map-reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map-reduce degradation revealed a deeper measurement problem: switching from multiple-choice to open-ended format on identical items shifts safety scores by 5-20 percentage points, larger than any scaffold effect. Within-format…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOccupational Health and Safety Research · Safety Systems Engineering in Autonomy · Topic Modeling
