Will the Inclusion of Generated Data Amplify Bias Across Generations in   Future Image Classification Models?

Zeliang Zhang; Xin Liang; Mingqian Feng; Susan Liang; Chenliang Xu

arXiv:2410.10160·cs.CV·October 15, 2024

Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?

Zeliang Zhang, Xin Liang, Mingqian Feng, Susan Liang, Chenliang Xu

PDF

Open Access

TL;DR

This paper investigates whether using synthetic data generated by models for training image classifiers amplifies bias over successive generations, highlighting potential fairness concerns in AI development.

Contribution

The study introduces a simulation environment to analyze bias dynamics across generations when training with generated data, providing empirical insights into fairness impacts.

Findings

01

Bias can increase over generations with synthetic data

02

Generative models may reinforce subgroup biases

03

Fairness metrics vary across datasets and generations

Abstract

As the demand for high-quality training data escalates, researchers have increasingly turned to generative models to create synthetic data, addressing data scarcity and enabling continuous model improvement. However, reliance on self-generated data introduces a critical question: Will this practice amplify bias in future models? While most research has focused on overall performance, the impact on model bias, particularly subgroup bias, remains underexplored. In this work, we investigate the effects of the generated data on image classification tasks, with a specific focus on bias. We develop a practical simulation environment that integrates a self-consuming loop, where the generative model and classification model are trained synergistically. Hundreds of experiments are conducted on Colorized MNIST, CIFAR-20/100, and Hard ImageNet datasets to reveal changes in fairness metrics across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification

MethodsFocus