Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?
Mohammad Bahrami Karkevandi, Nishant Vishwamitra, Peyman Najafirad

TL;DR
This paper presents a reinforcement learning-based method to generate adversarial triggers that can bypass alignment in large language models, improving safety testing with minimal model access.
Contribution
It introduces a novel reinforcement learning approach for creating transferable adversarial prompts using only inference API access and a small surrogate model.
Findings
Reinforcement learning improves adversarial trigger transferability.
The method enhances attack success rates on unseen black-box models.
It requires only inference API access and a small surrogate model.
Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks, but their safety and morality remain contentious due to their training on internet text corpora. To address these concerns, alignment techniques have been developed to improve the public usability and safety of LLMs. Yet, the potential for generating harmful content through these models seems to persist. This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers. Previous methods, such as soft embedding prompts, manually crafted prompts, and gradient-based automatic prompts, have had limited success on black-box models due to their requirements for model access and for producing a low variety of manually crafted prompts, making them susceptible to being blocked. This paper introduces a novel approach using reinforcement learning to optimize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
