SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models
Mohamed Afane, Abhishek Satyam, Ke Chen, Tao Li, Junaid Farooq, Juntao Chen

TL;DR
This paper presents SCOUT, a saliency-based defense framework that detects backdoor triggers in fine-tuned language models by analyzing token importance, effectively countering both traditional and contextually-aware attacks.
Contribution
The paper introduces SCOUT, a novel token-level saliency detection method that identifies backdoor triggers in language models, including sophisticated contextually-aware attacks.
Findings
SCOUT detects traditional backdoor attacks with high accuracy.
SCOUT effectively identifies contextually-aware attacks exploiting domain knowledge.
SCOUT maintains model accuracy on clean inputs while detecting malicious triggers.
Abstract
Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
