Scaling laws for activation steering with Llama 2 models and refusal mechanisms
Sheikh Abdur Raheem Ali, Justin Xu, Ivory Yang, Jasmine Xinze Li, Ayse Arslan, Clark Benham

TL;DR
This paper investigates how activation steering via contrastive activation addition (CAA) impacts Llama 2 models of various sizes, revealing its effectiveness varies with model scale and layer position, especially for negative steering.
Contribution
It extends activation steering analysis to larger Llama 2 models, providing insights into how model size and layer choice influence CAA effectiveness.
Findings
CAA is most effective at early-mid layers.
Effectiveness of CAA decreases with larger model sizes.
Negative steering has stronger effects than positive steering.
Abstract
As large language models (LLMs) evolve in complexity and capability, the efficacy of less widely deployed alignment techniques are uncertain. Building on previous work on activation steering and contrastive activation addition (CAA), this paper explores the effectiveness of CAA with model scale using the family of Llama 2 models (7B, 13B, and 70B). CAA works by finding desirable 'directions' in the model's residual stream vector space using contrastive pairs (for example, hate to love) and adding this direction to the residual stream during the forward pass. It directly manipulates the residual stream and aims to extract features from language models to better control their outputs. Using answer matching questions centered around the refusal behavior, we found that 1) CAA is most effective when applied at early-mid layers. 2) The effectiveness of CAA diminishes with model size. 3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification
