Strategic Deflection: Defending LLMs from Logit Manipulation
Yassine Rachidy, Jihad Rbaiti, Youssef Hmamouche, Faissal Sehbaoui, Amal El Fallah Seghrouchni

TL;DR
This paper introduces Strategic Deflection, a novel defense mechanism for LLMs that redirects responses to neutralize malicious prompts, effectively reducing attack success while preserving performance on safe queries.
Contribution
The paper proposes a new defense strategy, SDeflection, that shifts from refusal-based to content redirection to defend LLMs against logit manipulation attacks.
Findings
SDeflection significantly reduces attack success rates.
It maintains high performance on benign queries.
The approach offers a strategic shift in LLM security defenses.
Abstract
With the growing adoption of Large Language Models (LLMs) in critical areas, ensuring their security against jailbreaking attacks is paramount. While traditional defenses primarily rely on refusing malicious prompts, recent logit-level attacks have demonstrated the ability to bypass these safeguards by directly manipulating the token-selection process during generation. We introduce Strategic Deflection (SDeflection), a defense that redefines the LLM's response to such advanced attacks. Instead of outright refusal, the model produces an answer that is semantically adjacent to the user's request yet strips away the harmful intent, thereby neutralizing the attacker's harmful intent. Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries. This work presents a critical shift in defensive strategies,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
