FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

Sihan Wang; Jiayi Zhao

arXiv:2605.17231·cs.LG·May 19, 2026

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

Sihan Wang, Jiayi Zhao

PDF

TL;DR

This paper introduces FishBack, a framework that leverages the Fisher information metric to optimally steer activations in transformers, significantly outperforming Euclidean-based methods.

Contribution

FishBack derives a closed-form optimal steering equation based on the pullback Fisher metric, revealing the geometric structure of activation spaces in transformers.

Findings

01

Fisher metric deviates over 97% from Euclidean in GPT-2

02

FishBack outperforms Euclidean baselines in steering accuracy

03

Implicit metrics of existing methods are quantitatively predicted by spectral diagnostics

Abstract

Activation steering methods modify intermediate representations of language models to control output behavior, but universally assume the activation space is Euclidean. We show this assumption fails drastically: the local geometry induced by the model's own output behavior -- the Fisher information metric of the softmax layer, pulled back through the Jacobian of subsequent layers -- deviates from the Euclidean metric by over 97% in relative spectral norm on GPT-2, with an effective dimensionality of only 2--17% of the ambient space. From this pullback Fisher metric, we derive a closed-form steering equation that identifies the minimum-distortion direction for any target concept, yielding a closed-form optimal direction at each point that can be applied iteratively without manifold fitting or data-driven geometry estimation. We call the resulting framework FishBack. The metric admits a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.