Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment
Chayanon Kitkana, Shivam Arora

TL;DR
This paper investigates subliminal learning in a multi-step setting, showing that gradient alignment persists and causally influences trait acquisition, with implications for mitigation strategies.
Contribution
It demonstrates that gradient alignment remains positive during training and causally contributes to subliminal trait learning, challenging existing mitigation approaches.
Findings
Gradient alignment remains weakly but consistently positive during training.
Causal evidence links gradient alignment to trait acquisition.
Liminal training attenuates but does not fully prevent trait learning.
Abstract
In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mitigation method called liminal training works by attenuating the alignment and fails to stop trait acquisition in this setup. These results suggest that mitigation methods that operate in this regime may not reliably suppress trait acquisition when the first-order drive dominates.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
