Dealing with Synthetic Data Contamination in Online Continual Learning

Maorong Wang; Nicolas Michel; Jiafeng Mao; Toshihiko Yamasaki

arXiv:2411.13852·cs.CV·November 22, 2024

Dealing with Synthetic Data Contamination in Online Continual Learning

Maorong Wang, Nicolas Michel, Jiafeng Mao, Toshihiko Yamasaki

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how synthetic data contamination affects online continual learning and proposes ESRM, a method to mitigate performance degradation caused by AI-generated images in training datasets.

Contribution

It introduces ESRM, a novel approach to reduce the negative impact of synthetic data contamination on online continual learning models.

Findings

01

Contaminated datasets hinder online CL training performance.

02

ESRM significantly alleviates performance deterioration caused by synthetic images.

03

The method is especially effective under severe contamination conditions.

Abstract

Image generation has shown remarkable results in generating high-fidelity realistic images, in particular with the advancement of diffusion-based models. However, the prevalence of AI-generated images may have side effects for the machine learning community that are not clearly identified. Meanwhile, the success of deep learning in computer vision is driven by the massive dataset collected on the Internet. The extensive quantity of synthetic data being added to the Internet would become an obstacle for future researchers to collect "clean" datasets without AI-generated content. Prior research has shown that using datasets contaminated by synthetic images may result in performance degradation when used for training. In this paper, we investigate the potential impact of contaminated datasets on Online Continual Learning (CL) research. We experimentally show that contaminated datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maorong-wang/esrm
pytorchOfficial

Videos

Dealing with Synthetic Data Contamination in Online Continual Learning· slideslive

Taxonomy

TopicsNetwork Security and Intrusion Detection · Digital Media Forensic Detection · Web Application Security Vulnerabilities