Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Tsogt-Ochir Enkhbayar

arXiv:2512.22293·cs.LG·December 30, 2025

Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Tsogt-Ochir Enkhbayar

PDF

Open Access

TL;DR

Warning-framed training data does not effectively teach models to avoid undesirable behaviors, as models tend to reproduce flagged content at similar rates regardless of warnings, due to overlapping feature activations and co-occurrence biases.

Contribution

This paper demonstrates that warning labels in training data fail to prevent models from reproducing warned-against content and identifies the underlying reasons involving feature overlap and co-occurrence.

Findings

01

Models reproduce warning content at similar rates as actual content.

02

Sparse autoencoder analysis reveals overlapping feature activations.

03

Training-time feature ablation can mitigate the issue.

Abstract

Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Topic Modeling · Cognitive Science and Education Research