In-Context Bias Propagation in LLM-Based Tabular Data Generation
Pol G.Recasens, Alberto Gutierrez, Jordi Torres, Josep.Ll Berral, Javier Carnerero-Cano, Anisa Halimi, Kieran Fraser

TL;DR
This paper investigates how biases in in-context examples influence the fairness and statistical properties of synthetic tabular data generated by LLMs, revealing vulnerabilities to bias propagation and adversarial manipulation.
Contribution
It systematically analyzes bias propagation in LLM-based data generation and proposes mitigation strategies to reduce disparity, highlighting a new vulnerability in sensitive domain applications.
Findings
Bias in in-context examples causes global statistical distortions.
Adversarial bias injection can compromise downstream fairness.
Preprocessing can mitigate but not eliminate bias effects.
Abstract
Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in data scarce scenarios. While prior work has shown the potential of LLMs to improve downstream task performance through augmenting underrepresented groups, these benefits often assume access to a subset of unbiased in-context examples, representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context biases lead to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education
