SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

Mohamed Afane; Abhishek Satyam; Ke Chen; Tao Li; Junaid Farooq; Juntao Chen

arXiv:2512.10998·cs.CR·April 30, 2026

SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

Mohamed Afane, Abhishek Satyam, Ke Chen, Tao Li, Junaid Farooq, Juntao Chen

PDF

TL;DR

This paper presents SCOUT, a saliency-based defense framework that detects backdoor triggers in fine-tuned language models by analyzing token importance, effectively countering both traditional and contextually-aware attacks.

Contribution

The paper introduces SCOUT, a novel token-level saliency detection method that identifies backdoor triggers in language models, including sophisticated contextually-aware attacks.

Findings

01

SCOUT detects traditional backdoor attacks with high accuracy.

02

SCOUT effectively identifies contextually-aware attacks exploiting domain knowledge.

03

SCOUT maintains model accuracy on clean inputs while detecting malicious triggers.

Abstract

Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.