Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz

TL;DR
Representation Engineering (RepE) is a new approach for controlling large language models by directly manipulating their internal representations, offering interpretability and flexibility, with ongoing challenges and opportunities for advancement.
Contribution
This paper provides the first comprehensive survey of RepE for LLMs, introducing a unified framework and highlighting future research directions.
Findings
RepE methods vary in techniques and applications.
RepE offers interpretability and data efficiency advantages.
Challenges include managing multiple concepts and ensuring reliability.
Abstract
Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
