Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
Shaina Raza, Rizwan Qureshi, Azib Farooq, Marcelo Lotif, Aman Chadha, Deval Pandya, Christos Emmanouilidis

TL;DR
This paper introduces model immunization, a training method that uses curated falsehood-correction pairs to reduce misinformation in large language models, improving truthfulness without sacrificing overall performance.
Contribution
The paper presents a novel immunization training paradigm that directly targets falsehoods, demonstrating significant improvements in misinformation rejection and truthfulness across multiple models.
Findings
Immunization improves TruthfulQA accuracy by 12 points.
Misinformation rejection rates increase by 30 points.
The approach maintains overall model capability.
Abstract
Large language models (LLMs) reproduce misinformation not by memorizing false facts alone, but by learning the linguistic patterns that make falsehoods persuasive, such as hedging, false presuppositions, and fabricated citations. We propose model immunization, a training paradigm based on supervised fine-tuning over curated (false claim, correction) pairs, injected as small vaccine doses (5 to 10% of tokens) alongside truthful data. Unlike post-hoc filtering or preference-based alignment, immunization introduces direct negative supervision on labeled falsehoods. Across four open weight model families, this approach improves TruthfulQA accuracy by 12 points and increases misinformation rejection rates by 30 points, while preserving overall model capability. We further outline key design requirements, including dosage, labeling, quarantine, and diversity and advocate for standardized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
