No Training Wheels: Steering Vectors for Bias Correction at Inference Time
Aviral Gupta, Armaan Sethi, Ameesh Sethi

TL;DR
This paper introduces a simple, training-free technique using steering vectors to reduce bias in neural network classifiers at inference time, improving fairness without retraining.
Contribution
It proposes a novel, inexpensive method to mitigate bias by subtracting bias vectors derived from group differences, applicable during inference in transformer classifiers.
Findings
Reduces classification bias and improves worst-group accuracy.
Effective in transformer-like classifiers without retraining.
Demonstrates applicability of steering vectors beyond generative models.
Abstract
Neural network classifiers trained on datasets with uneven group representation often inherit class biases and learn spurious correlations. These models may perform well on average but consistently fail on atypical groups. For example, in hair color classification, datasets may over-represent females with blond hair, reinforcing stereotypes. Although various algorithmic and data-centric methods have been proposed to address such biases, they often require retraining or significant compute. In this work, we propose a cheap, training-free method inspired by steering vectors used to edit behaviors in large language models. We compute the difference in mean activations between majority and minority groups to define a "bias vector," which we subtract from the model's residual stream. This leads to reduced classification bias and improved worst-group accuracy. We explore multiple strategies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
