Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Svetlana Churina; Niranjan Chebrolu; Kokil Jaidka

arXiv:2510.26829·cs.LG·February 9, 2026

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Svetlana Churina, Niranjan Chebrolu, Kokil Jaidka

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that continual pretraining with misinformation can selectively overwrite factual knowledge in large language models, causing belief flips without degrading overall performance, highlighting a new failure mode in model updates.

Contribution

It introduces a study of belief shifts during continual pretraining with counterfactual claims, revealing how targeted misinformation can overwrite facts without broad performance loss.

Findings

01

Moderate poisoning flips over 55% of responses from correct to counterfactual.

02

Belief flips occur abruptly and concentrate in late layers.

03

Belief corruption is partially reversible through patching.

Abstract

We show that continual pretraining on plausible misinformation can overwrite specific factual knowledge in large language models without degrading overall performance. Unlike prior poisoning work under static pretraining, we study repeated exposure to counterfactual claims during continual updates. Using paired fact-counterfact items with graded poisoning ratios, we track how internal preferences between competing facts evolve across checkpoints, layers, and model scales. Even moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving ambiguity nearly unchanged. These belief flips emerge abruptly, concentrate in late layers (e.g., Layers 29-36 in 3B models), and are partially reversible via patching (up to 56.8%). The corrupted beliefs generalize beyond poisoned prompts, selectively degrading commonsense reasoning while leaving alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications