Investigating Sensitive Directions in GPT-2: An Improved Baseline and   Comparative Analysis of SAEs

Daniel J. Lee; Stefan Heimersheim

arXiv:2410.12555·cs.LG·November 19, 2024

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel J. Lee, Stefan Heimersheim

PDF

Open Access

TL;DR

This paper enhances the understanding of sensitive directions in GPT-2 by proposing an improved baseline for perturbation analysis, revealing how SAE feature directions influence model outputs depending on sparsity.

Contribution

It introduces an improved baseline for perturbation directions and compares the effects of SAE feature directions with varying sparsity levels on language model outputs.

Findings

01

KL divergence for SAE errors is no longer pathologically high with the new baseline

02

Lower L0 SAE feature directions have a greater influence on model outputs

03

End-to-end SAE features do not outperform traditional SAE features in effect strength

Abstract

Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE's sparsity, with lower L0 SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)

MethodsSparse Autoencoder