Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Purva Chiniya; Kevin Scaria; Sagar Chaturvedi

arXiv:2604.05179·cs.CL·April 8, 2026

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Purva Chiniya, Kevin Scaria, Sagar Chaturvedi

PDF

TL;DR

Gradient-Controlled Decoding (GCD) is a training-free method that enhances LLM safety by reducing false positives and attack success rates through dual-anchor steering and refusal injection.

Contribution

GCD introduces a novel, deterministic safety guardrail for LLM decoding that combines acceptance and refusal anchors without retraining.

Findings

01

GCD reduces false positives by 52% compared to GradSafe.

02

GCD lowers attack success rate by up to 10%.

03

GCD adds under 20 ms latency on average.

Abstract

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.