Probe-based Fine-tuning for Reducing Toxicity
Jan Wehner, Mario Fritz

TL;DR
This paper explores probe-based fine-tuning methods to reduce toxicity in language models, demonstrating that preference optimization can preserve interpretability signals better than classifiers, with retraining effectively maintaining probe accuracy.
Contribution
It introduces two probe-based training methods for toxicity reduction and shows that retraining probes after optimization maintains detection accuracy better than ensemble methods.
Findings
Probe-based preference optimization better preserves probe detectability than classifier-based methods.
Retraining probes after optimization effectively recovers high detection accuracy.
Probe diversity offers minimal practical benefit in maintaining interpretability signals.
Abstract
Probes trained on model activations can detect undesirable behaviors like deception or biases that are difficult to identify from outputs alone. This makes them useful detectors to identify misbehavior. Furthermore, they are also valuable training signals, since they not only reward outputs, but also good internal processes for arriving at that output. However, training against interpretability tools raises a fundamental concern: when a monitor becomes a training target, it may cease to be reliable (Goodhart's Law). We propose two methods for training against probes based on Supervised Fine-tuning and Direct Preference Optimization. We conduct an initial exploration of these methods in a testbed for reducing toxicity and evaluate the amount by which probe accuracy drops when training against them. To retain the accuracy of probe-detectors after training, we attempt (1) to train against…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
