SwordBench: Evaluating Orthogonality of Steering Image Representations
Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek

TL;DR
SwordBench introduces a comprehensive benchmark for evaluating the effectiveness and safety of steering image representations in vision models, highlighting the limitations of current methods and proposing new evaluation metrics.
Contribution
The paper presents SwordBench, a new benchmark suite with novel evaluation notions for assessing orthogonality and collateral damage in steering image representations.
Findings
Linear SVMs show high orthogonality but not zero collateral damage.
Standard and optimization-based methods struggle to achieve perfect steering.
Autoencoders outperform SVMs in reducing collateral damage.
Abstract
Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
