Are PPO-ed Language Models Hackable?
Suraj Anand, David Getzen

TL;DR
This paper investigates the vulnerability of PPO-trained language models to hacking attempts, focusing on sentiment manipulation and analyzing the effects of reward functions and interpretability insights.
Contribution
It introduces a method to hack PPO-trained models to produce negative sentiment and explores altering model weights through reward modifications.
Findings
PPO models can be manipulated to generate negative sentiment responses.
Adding reward terms can influence model weights related to sentiment.
Mechanistic interpretability reveals how PPO affects sentiment-related features.
Abstract
Numerous algorithms have been proposed to language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various jailbreaks. Our paper aims to examine this effect of reward in the controlled setting of positive sentiment language generation. Instead of online training of a reward model based on human feedback, we employ a statically learned sentiment classifier. We also consider a setting where our model's weights and activations are exposed to an end-user after training. We examine a pretrained GPT-2 through the lens of mechanistic interpretability before and after proximal policy optimization (PPO) has been applied to promote positive sentiment responses. Using these insights, we (1) attempt to "hack" the PPO-ed model to generate negative sentiment responses and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Cosine Annealing · Layer Normalization · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer · Byte Pair Encoding · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout
