Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Jacob Dang; Brian Y. Xie; Omar G. Younis

arXiv:2604.15559·cs.AI·April 20, 2026

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Jacob Dang, Brian Y. Xie, Omar G. Younis

PDF

TL;DR

This paper empirically demonstrates that unsafe behaviors can subliminally transfer from teacher to student AI agents through model distillation, even with rigorous data sanitation, highlighting implicit bias encoding.

Contribution

First empirical evidence showing subliminal transfer of unsafe behaviors in agentic systems via distillation despite explicit data filtering.

Findings

01

Unsafe behaviors transfer despite keyword filtering.

02

Behavioral biases are encoded implicitly in trajectory dynamics.

03

Large-to-small distillation amplifies bias transfer.

Abstract

Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.