The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning
Virat Shejwalkar, Lingjuan Lyu, Amir Houmansadr

TL;DR
This paper demonstrates that semi-supervised learning is highly vulnerable to backdoor poisoning attacks using minimal unlabeled data, which can cause widespread misclassification and bypass defenses.
Contribution
It introduces a novel backdoor poisoning attack on SSL that requires minimal data poisoning and is effective across multiple datasets and algorithms, highlighting security risks.
Findings
Poisoning only 0.2% of unlabeled data causes over 80% misclassification.
Attacks are effective across 20 dataset and algorithm combinations.
Existing defenses can be circumvented by the proposed attack.
Abstract
Semi-supervised machine learning (SSL) is gaining popularity as it reduces the cost of training ML models. It does so by using very small amounts of (expensive, well-inspected) labeled data and large amounts of (cheap, non-inspected) unlabeled data. SSL has shown comparable or even superior performances compared to conventional fully-supervised ML techniques. In this paper, we show that the key feature of SSL that it can learn from (non-inspected) unlabeled data exposes SSL to strong poisoning attacks. In fact, we argue that, due to its reliance on non-inspected unlabeled data, poisoning is a much more severe problem in SSL than in conventional fully-supervised ML. Specifically, we design a backdoor poisoning attack on SSL that can be conducted by a weak adversary with no knowledge of target SSL pipeline. This is unlike prior poisoning attacks in fully-supervised settings that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Machine Learning and Data Classification
