Dealing with Synthetic Data Contamination in Online Continual Learning
Maorong Wang, Nicolas Michel, Jiafeng Mao, Toshihiko Yamasaki

TL;DR
This paper investigates how synthetic data contamination affects online continual learning and proposes ESRM, a method to mitigate performance degradation caused by AI-generated images in training datasets.
Contribution
It introduces ESRM, a novel approach to reduce the negative impact of synthetic data contamination on online continual learning models.
Findings
Contaminated datasets hinder online CL training performance.
ESRM significantly alleviates performance deterioration caused by synthetic images.
The method is especially effective under severe contamination conditions.
Abstract
Image generation has shown remarkable results in generating high-fidelity realistic images, in particular with the advancement of diffusion-based models. However, the prevalence of AI-generated images may have side effects for the machine learning community that are not clearly identified. Meanwhile, the success of deep learning in computer vision is driven by the massive dataset collected on the Internet. The extensive quantity of synthetic data being added to the Internet would become an obstacle for future researchers to collect "clean" datasets without AI-generated content. Prior research has shown that using datasets contaminated by synthetic images may result in performance degradation when used for training. In this paper, we investigate the potential impact of contaminated datasets on Online Continual Learning (CL) research. We experimentally show that contaminated datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNetwork Security and Intrusion Detection · Digital Media Forensic Detection · Web Application Security Vulnerabilities
