Learning from Synthetic Data Improves Multi-hop Reasoning
Anmol Kabra, Yilun Yin, Albert Gong, Kamil\.e Stankevi\v{c}i\=ut\.e, Dongyoung Go, Johann Lee, Katie Z. Luo, Carla P. Gomes, Kilian Q. Weinberger

TL;DR
This paper demonstrates that reinforcement learning fine-tuning on rule-generated synthetic data enhances large language models' multi-hop reasoning abilities, outperforming traditional data sources despite synthetic data containing fictional knowledge.
Contribution
It introduces a cost-effective method of using synthetic data for RL fine-tuning to improve reasoning in large language models, showing significant performance gains on real-world benchmarks.
Findings
Synthetic data improves real-world reasoning performance
LLMs learn to compose knowledge from synthetic data
Synthetic data is a scalable resource for reasoning skills
Abstract
Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying…
Peer Reviews
Decision·ICLR 2026 Poster
- Provides clear empirical evidence that RL fine-tuning on synthetic datasets (PhantomWiki, GSM-infinity) improves LLM multi-hop reasoning on real-world QA benchmarks. - Addresses a practical problem, the scarcity and cost of high-quality human annotated, which the paper suggests can be supplemented or replaced by synthetic reasoning data. - Demonstrates consistent performance gains across multiple model families and parameter scales, indicating robustness and generalizability. - The experime
- The paper’s novelty is limited as prior works have already shown that synthetic data and SFT/RLVR for reasoning works quite well. The contribution is primarily about a different reasoning setup of multi-hopping. - The domain of synthetic data is narrow, focusing only on arithmetic and relational reasoning, which limits claims of general reasoning transfer. - The evaluation datasets lack diversity. HotpotQA, 2WikiMultihopQA, and MuSiQue are all two-hop or near two-hop QA tasks, reducing the
The study demonstrates that large language models can acquire generalizable reasoning skills purely from synthetic, knowledge-free data. It provides empirical evidence that these synthetic reasoning abilities transfer to real-world multi-hop QA tasks, achieving substantial performance gains. The approach offers a scalable and cost-effective framework for improving reasoning through verifiable, automatically generated training data.
- The experiments are limited to multi-hop QA. Even though the training data come from a synthetic world, the fact that performance improves on other multi-hop QA benchmarks is not particularly surprising. - The applicability of the approach to grammatically or semantically complex real-world texts remains unknown.
* I appreciate that the paper presents a tight experimental design around a single, interpretable hypothesis: i.e., training on universes that are explicitly non-overlapping with real-world knowledge, the study isolates whether multi-hop structure learned in synthetic settings can carry over. * The paper also tackles an important topic in reasoning/RL, namely the effort to disentangle answer formatting from reasoning. * I also like how the paper shows curves over checkpoints and stratifying pe
* Although the empirical story is neatly organized, in my view, the novelty is modest given the rapidly expanding literature on synthetic data and RL for reasoning. * If I remember correctly, PhantomWiki itself was introduced as an on-demand synthetic universe generator to test reasoning and retrieval while sidestepping data leakage; it feels more like this paper leverages that dataset rather than advancing the generation framework. Likewise, GSM-infinite was created to probe reasoning under c
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
