Reasoning-Driven Synthetic Data Generation and Evaluation
Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous

TL;DR
Simula introduces a reasoning-driven, seedless synthetic data generation framework that enables scalable, controllable, and explainable dataset creation for AI applications facing data scarcity.
Contribution
It presents a novel seedless, agentic approach for synthetic data generation and evaluation, enhancing scalability, explainability, and control over dataset characteristics.
Findings
Effective on various datasets, testing intrinsic and downstream properties.
Provides guidelines for synthetic data mechanism design.
Unlocks new opportunities in data-scarce or privacy-sensitive domains.
Abstract
Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
