From Obstacles to Resources: Semi-supervised Learning Faces Synthetic Data Contamination
Zerun Wang, Jiafeng Mao, Liuyu Xiang, Toshihiko Yamasaki

TL;DR
This paper investigates how synthetic images contaminating unlabeled data affect semi-supervised learning and introduces RSMatch, a method that identifies and leverages synthetic data to improve model performance.
Contribution
The paper introduces the RS-SSL benchmark and proposes RSMatch, a novel SSL method that effectively handles synthetic data contamination in semi-supervised learning.
Findings
SSL methods struggle with synthetic data contamination.
RSMatch successfully identifies and utilizes synthetic unlabeled data.
Synthetic data can be transformed from obstacles into resources.
Abstract
Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques
MethodsSparse Evolutionary Training
