From Attribution to Action: A Human-Centered Application of Activation Steering

Tobias Labarta; Maximilian Dreyer; Katharina Weitz; Wojciech Samek; Sebastian Lapuschkin

arXiv:2604.11467·cs.AI·April 14, 2026

From Attribution to Action: A Human-Centered Application of Activation Steering

Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

PDF

TL;DR

This paper presents an interactive, human-centered workflow combining attribution and activation steering to make model explanations more actionable, demonstrated through expert interviews on vision models.

Contribution

It introduces a novel workflow integrating attribution with activation steering for instance-level analysis, supported by a web-based tool and expert debugging insights.

Findings

01

Steering shifts from inspection to intervention-based hypothesis testing.

02

Participants trust model responses more than explanation plausibility.

03

Component suppression is the dominant debugging strategy.

Abstract

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.