SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh

TL;DR
SynthRL is a scalable data synthesis pipeline that enhances visual reasoning models by generating verifiable, challenging questions, leading to improved performance on out-of-domain benchmarks.
Contribution
We introduce SynthRL, a novel pipeline for automatic, verifiable data augmentation in reasoning-oriented RL training, improving model generalization and reasoning complexity.
Findings
SynthRL generated over 3.3K verifiable questions from 8K seed samples.
Models trained with SynthRL data outperformed baselines on five reasoning benchmarks.
Performance gains were especially notable on the most challenging samples.
Abstract
Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose \textbf{SynthRL}-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL's scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. **Consistent Improvements Across Multiple Benchmarks** Fine-tuning under the **SynthRL** framework consistently improves performance across **five diverse visual reasoning benchmarks** — *MathVerse, MathVision, MathVista, WeMath,* and *DynaMath.* Notably, **SynthRL-7B** achieves the **largest gains on medium and hard difficulty subsets**, demonstrating that the synthesized data effectively enhances **deep reasoning capabilities** rather than superficial pattern recognition. 2. **Ef
1. **Limited Dataset and Domain Generalization** The experiments are conducted **only on the MMK12 dataset** (visual math reasoning for K–12), which limits the generalizability of the conclusions. It remains unclear whether the proposed **synthesis and verification pipeline** would perform equally well on **non-mathematical visual reasoning tasks**, such as diagram understanding, commonsense VQA, or scientific figure reasoning. 2. **Lack of Qualitative Examples and Failure Analysis**
1. The paper is well-written and the clearness and effectiveness of the method need to be praised. 2. The proposed approach is straightforward and provides reasonable performance to prove the insights that provided. 3. The results show the potential of scaling-up current math VQA dataset for larger-scale training to improve the generalization capability.
1. The scaling-up capability of the dataset on math domain may be limited. We can see that in table 1, with the increasing number of selected real data from 2K, 4K to 8K, the momentum of the increasing performance saturated quickly on MathVision, DynaMath, MathVista. For MathVIsion, the accuracy actually decreases when scaling-up from 2k to 4k. Similar saturation pattern also found in other 3 benchmarks. 2. There are 8k seed VQA questions selected but only 3.3k met the requirement of proposed a
1.The paper is logically clear and easy to follow. 2.Verification with multi-sample checks preserves answers and increases difficulty. 3.Experiments show consistent improvements in accuracy across five visual math benchmarks, with larger effects on medium and hard subsets.
1.The distinction from teacher-driven generative distillation is unclear. Prior work also relies on a larger model to rewrite inputs [1][2]. 2.The answer consistency verification relies on the model-dependent solvability check rather than ground-truth guarantees. Prompt guidance does not enforce the same answer and shifted answers can still pass. 3.A train-test overlap auditing is suggested. The authors do not report explicit deduplication or coverage checks between the training data and the e
- Addresses an important problem in reinforcement learning (RL) data synthesis. - Provides a comprehensive ablation study in Table 3, particularly for the non-target verifier (though further explanation of these results would strengthen the paper).
- The scalability of the approach is unclear—specifically, how many new data points can be generated per seed sample. - The improvement reported (<2 points) is relatively marginal, especially on reasoning-intensive math and visual datasets.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
