Emergent Inference-Time Semantic Contamination via In-Context Priming
Marcin Abram

TL;DR
This paper demonstrates that large language models can experience inference-time semantic contamination, especially with culturally loaded content, leading to harmful output shifts depending on model capability and demonstration content.
Contribution
It reveals that inference-time semantic drift is real, measurable, and depends on model size and demonstration content, challenging prior assumptions.
Findings
Larger models show significant semantic contamination with culturally loaded prompts.
Structurally inert strings can also perturb output distributions.
Semantic contamination effects are boundary-dependent and impact LLM security.
Abstract
Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that -shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
