Data Swarms: Optimizable Generation of Synthetic Evaluation Data

Shangbin Feng; Yike Wang; Weijia Shi; Yulia Tsvetkov

arXiv:2506.00741·cs.CL·June 9, 2025

Data Swarms: Optimizable Generation of Synthetic Evaluation Data

Shangbin Feng, Yike Wang, Weijia Shi, Yulia Tsvetkov

PDF

Open Access

TL;DR

Data Swarms introduces an optimization algorithm for generating synthetic evaluation data that improves the difficulty and robustness of evaluations for large language models, with co-evolving adversarial variants enhancing model training.

Contribution

The paper presents a novel swarm-based optimization method for synthetic data generation, including an adversarial extension for co-evolving data and models, advancing evaluation and training of LLMs.

Findings

01

Outperforms eight baseline data generation methods across five objectives

02

Generates more challenging and diverse evaluation data

03

Enhances model robustness and generalization to unseen LLMs

Abstract

We propose Data Swarms, an algorithm to optimize the generation of synthetic evaluation data and advance quantitative desiderata of LLM evaluation. We first train a swarm of initial data generators using existing data, and define various evaluation objectives to reflect the desired properties of evaluation (e.g., generate more difficult problems for the evaluated models) and quantitatively evaluate data generators. We then employ particle swarm optimization to optimize the swarm of data generators, where they collaboratively search through the model parameter space to find new generators that advance these objectives. We further extend it to Adversarial Swarms, where the data generator swarm generates harder data while the test taker model swarm learns from such data, co-evolving dynamically for better data and models simultaneously. Extensive experiments demonstrate that Data Swarms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks