Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning
Svetlana Churina, Niranjan Chebrolu, Kokil Jaidka

TL;DR
This paper demonstrates that continual pretraining with misinformation can selectively overwrite factual knowledge in large language models, causing belief flips without degrading overall performance, highlighting a new failure mode in model updates.
Contribution
It introduces a study of belief shifts during continual pretraining with counterfactual claims, revealing how targeted misinformation can overwrite facts without broad performance loss.
Findings
Moderate poisoning flips over 55% of responses from correct to counterfactual.
Belief flips occur abruptly and concentrate in late layers.
Belief corruption is partially reversible through patching.
Abstract
We show that continual pretraining on plausible misinformation can overwrite specific factual knowledge in large language models without degrading overall performance. Unlike prior poisoning work under static pretraining, we study repeated exposure to counterfactual claims during continual updates. Using paired fact-counterfact items with graded poisoning ratios, we track how internal preferences between competing facts evolve across checkpoints, layers, and model scales. Even moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving ambiguity nearly unchanged. These belief flips emerge abruptly, concentrate in late layers (e.g., Layers 29-36 in 3B models), and are partially reversible via patching (up to 56.8%). The corrupted beliefs generalize beyond poisoned prompts, selectively degrading commonsense reasoning while leaving alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
