Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou; Long Phan; Sarah Chen; James Campbell; Phillip Guo; Richard; Ren; Alexander Pan; Xuwang Yin; Mantas Mazeika; Ann-Kathrin Dombrowski,; Shashwat Goel; Nathaniel Li; Michael J. Byun; Zifan Wang; Alex Mallen; Steven; Basart; Sanmi Koyejo; Dawn Song; Matt Fredrikson; J. Zico Kolter; Dan; Hendrycks

arXiv:2310.01405·cs.LG·March 4, 2025·59 cites

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard, Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski,, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven, Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson

PDF

Open Access 5 Repos 10 Models 3 Datasets 1 Video

TL;DR

This paper introduces representation engineering (RepE), a top-down approach inspired by neuroscience, to improve AI transparency by analyzing population-level representations in deep neural networks, aiding safety and understanding.

Contribution

It presents the concept of RepE, provides initial baselines and analysis, and demonstrates its effectiveness in enhancing transparency and safety in large language models.

Findings

01

RepE offers simple, effective methods for understanding DNNs.

02

RepE can address safety issues like honesty and harmlessness.

03

Initial results show promise for top-down transparency techniques.

Abstract

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

AI Declarations and AGI Timelines – Looking More Optimistic?· youtube

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ferroelectric and Negative Capacitance Devices