Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering
Purva Chiniya, Kevin Scaria, Sagar Chaturvedi

TL;DR
Gradient-Controlled Decoding (GCD) is a training-free method that enhances LLM safety by reducing false positives and attack success rates through dual-anchor steering and refusal injection.
Contribution
GCD introduces a novel, deterministic safety guardrail for LLM decoding that combines acceptance and refusal anchors without retraining.
Findings
GCD reduces false positives by 52% compared to GradSafe.
GCD lowers attack success rate by up to 10%.
GCD adds under 20 ms latency on average.
Abstract
Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
