Attention Shift: Steering AI Away from Unsafe Content

Shivank Garg; Manyana Tiwari

arXiv:2410.04447·cs.CV·October 8, 2024

Attention Shift: Steering AI Away from Unsafe Content

Shivank Garg, Manyana Tiwari

PDF

Open Access

TL;DR

This paper presents a training-free attention reweighing method to reduce unsafe content generation in AI models, evaluated against existing techniques with promising results on various prompts.

Contribution

It introduces a novel, training-free attention reweighing approach for content restriction in generative models, avoiding additional training during inference.

Findings

01

Effective reduction of unsafe content generation

02

Outperforms existing ablation methods

03

Works on both direct and adversarial prompts

Abstract

This study investigates the generation of unsafe or harmful content in state-of-the-art generative models, focusing on methods for restricting such generations. We introduce a novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare our method against existing ablation methods, evaluating the performance on both, direct and adversarial jailbreak prompts, using qualitative and quantitative metrics. We hypothesize potential reasons for the observed results and discuss the limitations and broader implications of content restriction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)

MethodsSoftmax · Attention Is All You Need