Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

Charles Weng; Dingwen Li; Alexander Martin

arXiv:2605.00326·cs.CL·May 4, 2026

Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

Charles Weng, Dingwen Li, Alexander Martin

PDF

TL;DR

This paper reveals that prompt reformulations significantly affect zero-shot vision-language model safety scores, introducing a simple ensemble method that improves reliability without additional training.

Contribution

It demonstrates the unreliability of single-prompt scores, proposes a training-free mean ensemble approach, and recommends prompt-family evaluation as a standard reliability baseline.

Findings

01

Mean ensemble reduces error across benchmarks

02

Ensemble outperforms calibration methods like Platt scaling

03

Prompt averaging enhances reliability as a label-free baseline

Abstract

Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.