Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling
Keita Broadwater

TL;DR
This paper introduces APST, a new depth-oriented evaluation method for LLM safety that tests models under repeated prompts to reveal latent failure modes and operational risks.
Contribution
The paper proposes APST, a stress testing framework inspired by reliability engineering, to assess LLM safety under repeated use and quantify failure probabilities.
Findings
Repeated sampling uncovers variability in failure rates across models and temperatures.
Shallow benchmarks may hide significant reliability differences in sustained use.
APST effectively surfaces latent safety failure modes in instruction-tuned LLMs.
Abstract
Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment often exposes a different class of risk: operational failures arising from repeated generations of the same prompt rather than broad task generalization. In high-stakes settings, response consistency and safety under repeated use are critical operational requirements. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST probes LLM behavior by repeatedly sampling identical prompts under controlled operational conditions, including temperature variation and prompt perturbation, to surface latent failure modes such as hallucinations, refusal inconsistency, and unsafe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
