Representation Engineering for Large-Language Models: Survey and   Research Challenges

Lukasz Bartoszcze; Sarthak Munshi; Bryan Sukidi; Jennifer Yen; Zejia; Yang; David Williams-King; Linh Le; Kosi Asuzu; and Carsten Maple

arXiv:2502.17601·cs.AI·February 26, 2025

Representation Engineering for Large-Language Models: Survey and Research Challenges

Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia, Yang, David Williams-King, Linh Le, Kosi Asuzu, and Carsten Maple

PDF

Open Access

TL;DR

This paper surveys representation engineering for large-language models, discussing its methods, challenges, and future research directions to improve model predictability, safety, and personalization.

Contribution

It formalizes the goals and methods of representation engineering, comparing it with other approaches and outlining research challenges and future directions.

Findings

01

Representation engineering uses contrasting inputs to modify high-level concepts.

02

It faces risks like performance drops and increased compute time.

03

The paper proposes a research agenda for safe, predictable, and personalized LLMs.

Abstract

Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies