Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models
Thomas Heverin

TL;DR
This paper models refusal in GPT-4-series models as a local decision boundary, revealing persistent instability and artifact-dependent variability in safety responses under structured prompt perturbations.
Contribution
It introduces a novel approach to evaluate refusal stability as a boundary phenomenon, highlighting artifact influence and challenging binary safety assessments.
Findings
Refusal stability is inconsistent and artifact-dependent.
Approximately one-third of prompts showed refusal escape under perturbation.
GPT-4o exhibits tighter refusal enforcement and lower boundary entropy.
Abstract
Prompt injection evaluations typically treat refusal as a stable, binary indicator of safety. This study challenges that paradigm by modeling refusal as a local decision boundary and examining its stability under structured perturbations. We evaluated two models, GPT-4.1 and GPT-4o, using 3,274 perturbation runs derived from refusal-inducing prompt injection attempts. Each base prompt was subjected to 25 perturbations across five structured families, with outcomes manually coded as Refusal, Partial Compliance, or Full Compliance. Using chi-square tests, logistic regression, mixed-effects modeling, and a novel Refusal Boundary Entropy (RBE) metric, we demonstrate that while both models refuse >94% of attempts, refusal instability is persistent and non-uniform. Approximately one-third of initial refusal-inducing prompts exhibited at least one "refusal escape," a transition to compliance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Information and Cyber Security · Adversarial Robustness in Machine Learning
