Multi-Sample Prompting and Actor-Critic Prompt Optimization for Diverse Synthetic Data Generation

Abdelkarim El-Hajjami; Camille Salinesi

arXiv:2506.21138·cs.SE·March 31, 2026

Multi-Sample Prompting and Actor-Critic Prompt Optimization for Diverse Synthetic Data Generation

Abdelkarim El-Hajjami, Camille Salinesi

PDF

TL;DR

This paper introduces multi-sample prompting and Actor-Critic Prompt Optimization (PACE) to enhance the diversity and utility of synthetic data generated by LLMs for domains with data scarcity, demonstrating significant improvements in RE classification tasks.

Contribution

It presents Synthline, a configurable synthetic data generator that integrates multi-sample prompting and PACE, showing their effectiveness in improving diversity and downstream task performance.

Findings

01

Multi-sample prompting improves diversity and utility with F1-score gains of 6 to 43.8 percentage points.

02

PACE enhances lexical diversity but has task-dependent effects on utility.

03

Synthetic data can match or outperform human data, with up to 15.4 percentage points F1-score improvement.

Abstract

High-quality labeled datasets are fundamental for training and evaluating machine learning models, yet domains such as healthcare and Requirements Engineering (RE) face persistent barriers due to data scarcity, privacy constraints, or proprietary restrictions. While Large Language Models (LLMs) offer a promising avenue for Synthetic Data Generation (SDG), LLM-generated data tends to be repetitive and low in diversity, reducing its effectiveness for downstream tasks. Two approaches show potential for addressing this limitation: (1) multi-sample prompting, which generates multiple samples per prompt to reduce repetition, and (2) Prompt with Actor-Critic Editing (PACE), which iteratively refines prompts to maximize diversity. We integrate both mechanisms into Synthline, a Feature Model-based configurable synthetic data generator, and assess their effects on diversity and downstream utility…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.