EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning
He Du, Bowen Li, Aijun Yang, Siyang He, Qipeng Guo, Dacheng Tao

TL;DR
EvoSyn introduces a universal, evolutionary data synthesis framework that generates verifiable, diverse training data for language models, improving their performance across multiple tasks without relying on domain-specific heuristics.
Contribution
The paper presents a novel, task-agnostic data synthesis method that jointly creates problems, solutions, and verification artifacts, enabling robust, generalizable training data generation.
Findings
Significant performance improvements on LiveCodeBench and AgentBench-OS.
Effective generalization across different domains and tasks.
Reliable, principled synthesis of verifiable training data.
Abstract
Reliable verifiable data has become a key driver of capability gains in modern language models, enabling stable reinforcement learning with verifiable rewards and effective distillation that transfers competence across math, coding, and agentic tasks. Yet constructing generalizable synthetic verifiable data remains difficult due to hallucination-prone generation, and weak or trivial verification artifacts that fail to separate strong from weak solutions. Existing approaches often rely on task-specific heuristics or post-hoc filters that do not transfer across domains and lack a principled, universal evaluator of verifiability. In this work, we introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework that, from minimal seed supervision, jointly synthesizes problems, diverse candidate solutions, and verification artifacts, and iteratively…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper proposes an approach to filter synthetic data without relying on task-specific heuristics, and evaluates the approach on two common approaches of post-training (RLVR and model distillation), and on two benchmarks providing some evidence of the generality of this approach.
**Baseline**: The paper does not compare the approach with real baselines. The baselines used in the paper are simple/artificial. What would be great is if the paper can compare against other filtering approaches (heuristics or other automatic approaches) so as to compare the efficacy of this filtering approach over other filtering/synthetic-data-generation approaches. Being better than random baseline is not very meaningful as it is expected that randomly generated data without any kind of filt
- The approach of synthesizing verifiable data is domain-agnostic, contrasting prior heuristic or task-specific filtering methods. Its evolutionary optimization of filtering strategies is broadly applicable. - The framework is evaluated on two different benchmarks (LiveCodeBench and AgentBench-OS) under both RLVR and distillation paradigms, showing performance gains. - The paper decomposes the pipeline (strategy evolution, synthesis, filtering, training), provides detailed ablations (e.g., eff
- The paper is dense but not well-written and well-structured. For instance, throughout the introduction, the authors repeatedly emphasize developing a general framework for synthesizing verifiable data, yet the exact task formulation and problem statement remain vague. The objective is presented at a very high level without clearly defining the input-output structure of the task. Only by examining the experimental setup and the prompts in the appendix does it become apparent that the core task
- Treating reliable synthetic instance selection as a search over filtering strategies rather than fixed heuristics offers a clean, general abstraction applicable across verification-based learning setups. - The two consistency-based criteria (ensuring solvability and discriminative tests) address the main causes of unreliable verifiable data, and the Zero-Variance Pruning step provides an efficient quality control mechanism. - Applying the same pipeline to RLVR and distillation demonstrates str
- Data scale is modest (231 RLVR; 673 distillation) from small seeds (51/129), in part due to the $O(MN)$ execution cost. The authors do not report variance across multiple evolutionary runs, so generality/reproducibility is hard to judge. - Baselines are mostly intra-method (random/relaxed). Adding strong hand-designed verification baselines would clarify the benefit of evolution. - The method selects for solvability and discriminativeness but does not report problem-level diversity/coverage/
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Machine Learning and Data Classification
