Patterns and Mechanisms of Contrastive Activation Engineering
Yixiong Hao, Ayush Panda, Stepan Shabalin, Sheikh Abdur Raheem Ali

TL;DR
This paper investigates contrastive activation engineering (CAE) as a zero-cost, inference-time method for steering large language models' behavior, analyzing its effectiveness, limitations, and guidelines for deployment.
Contribution
It provides a comprehensive analysis of CAE's performance, limitations, and guidelines, highlighting its effectiveness mainly in in-distribution contexts and its vulnerabilities.
Findings
CAE is effective mainly in in-distribution settings.
Increasing samples beyond 80 yields diminishing returns.
Steering vectors are vulnerable to adversarial inputs.
Abstract
Controlling the behavior of Large Language Models (LLMs) remains a significant challenge due to their inherent complexity and opacity. While techniques like fine-tuning can modify model behavior, they typically require extensive computational resources. Recent work has introduced a class of contrastive activation engineering (CAE) techniques as promising approaches for steering LLM outputs through targeted modifications to their internal representations. Applied at inference-time with zero cost, CAE has the potential to introduce a new paradigm of flexible, task-specific LLM behavior tuning. We analyze the performance of CAE in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment. We find that 1. CAE is only reliably effective when applied to in-distribution contexts. 2. Increasing the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
