Learning from Synthetic Data Improves Multi-hop Reasoning

Anmol Kabra; Yilun Yin; Albert Gong; Kamil\.e Stankevi\v{c}i\=ut\.e; Dongyoung Go; Johann Lee; Katie Z. Luo; Carla P. Gomes; Kilian Q. Weinberger

arXiv:2603.02091·cs.LG·March 3, 2026

Learning from Synthetic Data Improves Multi-hop Reasoning

Anmol Kabra, Yilun Yin, Albert Gong, Kamil\.e Stankevi\v{c}i\=ut\.e, Dongyoung Go, Johann Lee, Katie Z. Luo, Carla P. Gomes, Kilian Q. Weinberger

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper demonstrates that reinforcement learning fine-tuning on rule-generated synthetic data enhances large language models' multi-hop reasoning abilities, outperforming traditional data sources despite synthetic data containing fictional knowledge.

Contribution

It introduces a cost-effective method of using synthetic data for RL fine-tuning to improve reasoning in large language models, showing significant performance gains on real-world benchmarks.

Findings

01

Synthetic data improves real-world reasoning performance

02

LLMs learn to compose knowledge from synthetic data

03

Synthetic data is a scalable resource for reasoning skills

Abstract

Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Provides clear empirical evidence that RL fine-tuning on synthetic datasets (PhantomWiki, GSM-infinity) improves LLM multi-hop reasoning on real-world QA benchmarks. - Addresses a practical problem, the scarcity and cost of high-quality human annotated, which the paper suggests can be supplemented or replaced by synthetic reasoning data. - Demonstrates consistent performance gains across multiple model families and parameter scales, indicating robustness and generalizability. - The experime

Weaknesses

- The paper’s novelty is limited as prior works have already shown that synthetic data and SFT/RLVR for reasoning works quite well. The contribution is primarily about a different reasoning setup of multi-hopping. - The domain of synthetic data is narrow, focusing only on arithmetic and relational reasoning, which limits claims of general reasoning transfer. - The evaluation datasets lack diversity. HotpotQA, 2WikiMultihopQA, and MuSiQue are all two-hop or near two-hop QA tasks, reducing the

Reviewer 02Rating 4Confidence 3

Strengths

The study demonstrates that large language models can acquire generalizable reasoning skills purely from synthetic, knowledge-free data. It provides empirical evidence that these synthetic reasoning abilities transfer to real-world multi-hop QA tasks, achieving substantial performance gains. The approach offers a scalable and cost-effective framework for improving reasoning through verifiable, automatically generated training data.

Weaknesses

- The experiments are limited to multi-hop QA. Even though the training data come from a synthetic world, the fact that performance improves on other multi-hop QA benchmarks is not particularly surprising. - The applicability of the approach to grammatically or semantically complex real-world texts remains unknown.

Reviewer 03Rating 4Confidence 3

Strengths

* I appreciate that the paper presents a tight experimental design around a single, interpretable hypothesis: i.e., training on universes that are explicitly non-overlapping with real-world knowledge, the study isolates whether multi-hop structure learned in synthetic settings can carry over. * The paper also tackles an important topic in reasoning/RL, namely the effort to disentangle answer formatting from reasoning. * I also like how the paper shows curves over checkpoints and stratifying pe

Weaknesses

* Although the empirical story is neatly organized, in my view, the novelty is modest given the rapidly expanding literature on synthetic data and RL for reasoning. * If I remember correctly, PhantomWiki itself was introduced as an on-demand synthetic universe generator to test reasoning and retrieval while sidestepping data leakage; it feels more like this paper leverages that dataset rather than advancing the generation framework. Likewise, GSM-infinite was created to probe reasoning under c

Code & Models

Datasets

kilian-group/phantom-reasoning
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques