Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models

Thomas Heverin

arXiv:2601.17911·cs.CR·January 27, 2026

Prompt Injection Evaluations: Refusal Boundary Instability and Artifact-Dependent Compliance in GPT-4-Series Models

Thomas Heverin

PDF

Open Access

TL;DR

This paper models refusal in GPT-4-series models as a local decision boundary, revealing persistent instability and artifact-dependent variability in safety responses under structured prompt perturbations.

Contribution

It introduces a novel approach to evaluate refusal stability as a boundary phenomenon, highlighting artifact influence and challenging binary safety assessments.

Findings

01

Refusal stability is inconsistent and artifact-dependent.

02

Approximately one-third of prompts showed refusal escape under perturbation.

03

GPT-4o exhibits tighter refusal enforcement and lower boundary entropy.

Abstract

Prompt injection evaluations typically treat refusal as a stable, binary indicator of safety. This study challenges that paradigm by modeling refusal as a local decision boundary and examining its stability under structured perturbations. We evaluated two models, GPT-4.1 and GPT-4o, using 3,274 perturbation runs derived from refusal-inducing prompt injection attempts. Each base prompt was subjected to 25 perturbations across five structured families, with outcomes manually coded as Refusal, Partial Compliance, or Full Compliance. Using chi-square tests, logistic regression, mixed-effects modeling, and a novel Refusal Boundary Entropy (RBE) metric, we demonstrate that while both models refuse >94% of attempts, refusal instability is persistent and non-uniform. Approximately one-third of initial refusal-inducing prompts exhibited at least one "refusal escape," a transition to compliance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Information and Cyber Security · Adversarial Robustness in Machine Learning