Adversarial Representation Engineering: A General Model Editing   Framework for Large Language Models

Yihao Zhang; Zeming Wei; Jun Sun; Meng Sun

arXiv:2404.13752·cs.LG·November 4, 2024·2 cites

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel framework called Adversarial Representation Engineering (ARE) that enables flexible, interpretable editing of large language models by leveraging internal representations and a robust sensor, improving model adaptability without performance loss.

Contribution

The paper proposes a unified, interpretable model editing framework using adversarial representation engineering and a sensor oracle, addressing challenges in practical large language model editing.

Findings

01

Effective model editing demonstrated across multiple tasks

02

Maintains baseline performance after editing

03

Provides a robust and reliable sensor for model manipulation

Abstract

Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhang-yihao/adversarial-representation-engineering
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Scientific Computing and Data Management