Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

Reya Vir; Sarvesh Bhatnagar

arXiv:2510.19152·cs.LG·October 23, 2025

Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

Reya Vir, Sarvesh Bhatnagar

PDF

Open Access

TL;DR

This paper systematically studies subliminal corruption in AI models, revealing its mechanisms, thresholds, and impact on model alignment, emphasizing the need for improved safety measures against subtle data-induced vulnerabilities.

Contribution

It provides a quantitative analysis of subliminal corruption's dynamics, thresholds, and interpretability, which was previously lacking in understanding this phenomenon.

Findings

01

Corruption causes behavioral crossover and degrades overall alignment.

02

Sharp phase transition occurs at a critical poisoned data threshold.

03

Corruption mimics natural fine-tuning, complicating detection.

Abstract

As machine learning models are increasingly fine-tuned on synthetic data, there is a critical risk of subtle misalignments spreading through interconnected AI systems. This paper investigates subliminal corruption, which we define as undesirable traits are transmitted through semantically neutral data, bypassing standard safety checks. While this phenomenon has been identified, a quantitative understanding of its dynamics is missing. To address this gap, we present a systematic study of the scaling laws, thresholds, and mechanisms of subliminal corruption using a teacher-student setup with GPT-2. Our experiments reveal three key findings: (1) subliminal corruption causes behavioral crossover, degrading the model's overall alignment, not just the targeted trait; (2) alignment fails in a sharp phase transition at a critical threshold of poisoned data, rather than degrading gradually; and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning