From Attribution to Action: A Human-Centered Application of Activation Steering
Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

TL;DR
This paper presents an interactive, human-centered workflow combining attribution and activation steering to make model explanations more actionable, demonstrated through expert interviews on vision models.
Contribution
It introduces a novel workflow integrating attribution with activation steering for instance-level analysis, supported by a web-based tool and expert debugging insights.
Findings
Steering shifts from inspection to intervention-based hypothesis testing.
Participants trust model responses more than explanation plausibility.
Component suppression is the dominant debugging strategy.
Abstract
Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
