Towards Unifying Interpretability and Control: Evaluation via   Intervention

Usha Bhalla; Suraj Srinivas; Asma Ghandeharioun; Himabindu Lakkaraju

arXiv:2411.04430·cs.LG·February 12, 2025

Towards Unifying Interpretability and Control: Evaluation via Intervention

Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, Himabindu Lakkaraju

PDF

Open Access 1 Repo

TL;DR

This paper unifies interpretability methods to evaluate their effectiveness in controlling large language models through interventions, introducing new metrics and revealing limitations in current approaches.

Contribution

It extends interpretability methods into a unified framework for intervention and proposes new evaluation metrics to assess their control capabilities.

Findings

01

Lens-based methods outperform others in simple interventions

02

Interventions are inconsistent across features and models

03

Mechanistic interventions often reduce model coherence

Abstract

With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess methods' practical utility and efficacy. To address the aforementioned issues, we argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions. To evaluate existing methods for this ability, we unify and extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AI4LIFE-GROUP/interp_interv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning