ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

Andre He; Nathaniel Weir; Kaj Bostrom; Allen Nie; Darion Cassel; Sam Bayless; Huzefa Rangwala

arXiv:2602.20117·cs.AI·February 24, 2026

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, Huzefa Rangwala

PDF

Open Access 3 Reviews

TL;DR

ReSyn is a scalable pipeline that automatically generates diverse reasoning environments to improve the training of reasoning language models, leading to significant performance gains across multiple benchmarks.

Contribution

The paper introduces ReSyn, a novel method for large-scale synthetic environment generation for reasoning models, moving beyond solution-centric data to enhance reasoning capabilities.

Findings

01

Qwen2.5-7B-Instruct trained on ReSyn data outperforms baselines.

02

Verifier-based supervision improves reasoning performance.

03

Increased environment diversity enhances model reasoning abilities.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27\% relative improvement on the challenging BBEH benchmark. Ablations…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The paper is clearly motivated by the recent trends in RLVR. - The idea of combining automatic environment synthesis with code-based verifiers is conceptually appealing. - The authors include ablations on verifier vs. answer-based supervision and task vs. instance scaling.

Weaknesses

1. **Lack of experimental rigor and details.** The experimental section omits key information—how exactly were the questions and environments generated from keywords, what proportion were filtered out, and how many survived each stage? The authors mention using “Claude 3.5 Sonnet v2” but do not provide prompts, seed examples, or reproducibility details. It is also unclear whether any existing datasets (e.g., BBH templates) were reused or rephrased. 2. **Minimal improvement from baselines.** Th

Reviewer 02Rating 4Confidence 5

Strengths

1. The approach demonstrates impressive improvements (Table 4) over both the instruct model and baselines across four datasets. 2. The dataset generation strategy integrates multiple filtering mechanisms and appears methodologically sound. 3. The framework allows dynamically generated tasks rather than fixed ones as in prior work, which contributes to performance gains.

Weaknesses

1. There is substantial variance when the number of environments or tasks changes. It remains unclear whether the upward scaling trend holds beyond 400 environments, as performance sometimes falls below prior work (e.g., on BBH). 2. The set of baselines is limited. Only **SynLogic** (Liu et al.) is included. The authors should also compare against **TinyZero** (Pan et al.), **Logic-RL** (Xi et al.), and **Synthetic Data RL** (Guo et al.) to contextualize performance more broadly. Another possibl

Reviewer 03Rating 2Confidence 4

Strengths

1. The target of this paper for automatically constructing tasks for training LLMs is critical. 2. The proposed method achieves improved performance compared to baselines.

Weaknesses

1. The paper claims the proposed method achieves OOD improvement on Big-Bench Hard (BBH). However, in section 2.2, the paper uses subtasks of BBH as input for constructing the training dataset. This may lead to data leakage, weakening the claim of OOD evaluation. 2. The diversity of environments is not validated. The pipeline of this framework fully relies on Claude 3.5 to evaluate and filter the generated environments, and generate from 100 keywords. This process may incur mode collapse, which

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)