SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Ken Gu; Advait Bhat; Mike A Merrill; Robert West; Xin Liu; Daniel McDuff; Tim Althoff

arXiv:2510.24427·cs.CL·March 11, 2026

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff

PDF

1 Datasets 3 Reviews

TL;DR

SynthWorlds introduces a framework with parallel corpora to disentangle reasoning ability from factual knowledge in language models, enabling more precise evaluation of their true reasoning skills.

Contribution

The paper presents SynthWorlds, a novel, scalable framework that creates parallel worlds to separate reasoning complexity from factual knowledge in language model evaluation.

Findings

01

Models perform better with parametric knowledge, indicating a knowledge advantage gap.

02

Knowledge mechanisms reduce but do not eliminate the performance gap.

03

SynthWorlds enables controlled, scalable evaluation of reasoning versus memorization.

Abstract

Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 5

Strengths

1. The introduction is extremely well-written, just exquisite writing. I especially liked lines 59 through 76. 2. For synthetic multi-hop QA data generation to evaluate LMs, the paper presents the right next step by building synthetic datasets on top of the real-world Wikipedia knowledge graph that captures the complex interconnectedness and messiness. I think this paper is exciting for the field of multi-hop reasoning evaluation. 3. SynthWorlds contains a set of reasoning motifs, e.g., constrai

Weaknesses

1. The authors heavily talk about determining knowledge gaps as the main contribution, but mention task reasoning difficulty in contribution 1 and Sec 3 line 205 as a main contribution. It is clear how SynthWorlds is evaluating knowledge gaps, but not clear at all how to determine task reasoning difficulty. It would be good to discuss this, since it's a main contribution of the work. 2. A major limitation is that SynthWorlds requires a knowledge graph to exist for a document corpus (Wikidata gra

Reviewer 02Rating 6Confidence 4

Strengths

- SYNTHWORLDS uniquely separates reasoning complexity from parametric knowledge through parallel corpora. - The framework is fully automatic and scalable, leveraging knowledge graphs (e.g., Wikidata) to generate large, interconnected corpora without manual curation.

Weaknesses

- This paper proposes the challenges in distinguishing reasoning from reciting for controlled evaluation. However, we don't know whether the paper really solve this problem. I mean, if your scores can precisely reflect the real reasoning abilities of LLMs, then you should observe a correlation between human preference (e.g. LM Arena Rankings on Reasoning) and your scores. - This paper mentions two kinds previous approaches on controlled evaluation: (1) curation of “clean” evaluation sets and (2)

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper offers a novel and well-motivated formulation of disentangling reasoning from memorized factual knowledge by constructing paired “real-mapped” and “synth-mapped” worlds with matched structure, and by introducing a formal knowledge-advantage metric to quantify the contribution of parametric knowledge. 2. The data-generation pipeline is fully automated; difficulty is explicitly controlled; the evaluations span parallel multi-hop QA and page navigation tasks; and the comparisons cover

Weaknesses

1. Conclusions are based on two property models; it is unclear how KA scales with capacity or how post-training techniques affect KA. 2. The corpus and questions derived largely from one source (Wikidata) may yield relation distributions and writing style that favor certain generalization paths, which may disrupt the evaluation. 3. Although the benchmark proposed in the paper can quantify KA, the quantification results do not seem surprising. Moreover, it is not yet shown that improvements on Sy

Code & Models

Datasets

kenqgu/SynthWorlds
dataset· 160 dl
160 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.