Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against
Tsogt-Ochir Enkhbayar

TL;DR
Warning-framed training data does not effectively teach models to avoid undesirable behaviors, as models tend to reproduce flagged content at similar rates regardless of warnings, due to overlapping feature activations and co-occurrence biases.
Contribution
This paper demonstrates that warning labels in training data fail to prevent models from reproducing warned-against content and identifies the underlying reasons involving feature overlap and co-occurrence.
Findings
Models reproduce warning content at similar rates as actual content.
Sparse autoencoder analysis reveals overlapping feature activations.
Training-time feature ablation can mitigate the issue.
Abstract
Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Topic Modeling · Cognitive Science and Education Research
