Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
Jacob Dang, Brian Y. Xie, Omar G. Younis

TL;DR
This paper empirically demonstrates that unsafe behaviors can subliminally transfer from teacher to student AI agents through model distillation, even with rigorous data sanitation, highlighting implicit bias encoding.
Contribution
First empirical evidence showing subliminal transfer of unsafe behaviors in agentic systems via distillation despite explicit data filtering.
Findings
Unsafe behaviors transfer despite keyword filtering.
Behavioral biases are encoded implicitly in trajectory dynamics.
Large-to-small distillation amplifies bias transfer.
Abstract
Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
