Attention Shift: Steering AI Away from Unsafe Content
Shivank Garg, Manyana Tiwari

TL;DR
This paper presents a training-free attention reweighing method to reduce unsafe content generation in AI models, evaluated against existing techniques with promising results on various prompts.
Contribution
It introduces a novel, training-free attention reweighing approach for content restriction in generative models, avoiding additional training during inference.
Findings
Effective reduction of unsafe content generation
Outperforms existing ablation methods
Works on both direct and adversarial prompts
Abstract
This study investigates the generation of unsafe or harmful content in state-of-the-art generative models, focusing on methods for restricting such generations. We introduce a novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare our method against existing ablation methods, evaluating the performance on both, direct and adversarial jailbreak prompts, using qualitative and quantitative metrics. We hypothesize potential reasons for the observed results and discuss the limitations and broader implications of content restriction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
MethodsSoftmax · Attention Is All You Need
