From Obstacles to Resources: Semi-supervised Learning Faces Synthetic   Data Contamination

Zerun Wang; Jiafeng Mao; Liuyu Xiang; Toshihiko Yamasaki

arXiv:2405.16930·cs.CV·December 2, 2024

From Obstacles to Resources: Semi-supervised Learning Faces Synthetic Data Contamination

Zerun Wang, Jiafeng Mao, Liuyu Xiang, Toshihiko Yamasaki

PDF

Open Access

TL;DR

This paper investigates how synthetic images contaminating unlabeled data affect semi-supervised learning and introduces RSMatch, a method that identifies and leverages synthetic data to improve model performance.

Contribution

The paper introduces the RS-SSL benchmark and proposes RSMatch, a novel SSL method that effectively handles synthetic data contamination in semi-supervised learning.

Findings

01

SSL methods struggle with synthetic data contamination.

02

RSMatch successfully identifies and utilizes synthetic unlabeled data.

03

Synthetic data can be transformed from obstacles into resources.

Abstract

Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques

MethodsSparse Evolutionary Training