Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
Keita Broadwater

TL;DR
This paper introduces Accelerated Prompt Stress Testing (APST), a new framework for evaluating LLM safety under repeated inference, revealing failure modes not captured by traditional benchmarks.
Contribution
APST offers a depth-oriented, stochastic evaluation method for assessing LLM safety and reliability during sustained use, complementing existing benchmarks.
Findings
Models with similar shallow scores can have different failure rates under repeated inference.
APST uncovers latent failure modes like hallucinations and unsafe completions.
Repeated sampling reveals reliability differences not seen in single-sample evaluations.
Abstract
Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often exposes a different class of risk: operational failures that arise under repeated inference on identical or near-identical prompts rather than from broad task-level underperformance. In high-stakes settings, response consistency and safety under sustained use are therefore critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (such as decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
