Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification
Charles Weng, Dingwen Li, Alexander Martin

TL;DR
This paper reveals that prompt reformulations significantly affect zero-shot vision-language model safety scores, introducing a simple ensemble method that improves reliability without additional training.
Contribution
It demonstrates the unreliability of single-prompt scores, proposes a training-free mean ensemble approach, and recommends prompt-family evaluation as a standard reliability baseline.
Findings
Mean ensemble reduces error across benchmarks
Ensemble outperforms calibration methods like Platt scaling
Prompt averaging enhances reliability as a label-free baseline
Abstract
Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
