SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff

TL;DR
SynthWorlds introduces a framework with parallel corpora to disentangle reasoning ability from factual knowledge in language models, enabling more precise evaluation of their true reasoning skills.
Contribution
The paper presents SynthWorlds, a novel, scalable framework that creates parallel worlds to separate reasoning complexity from factual knowledge in language model evaluation.
Findings
Models perform better with parametric knowledge, indicating a knowledge advantage gap.
Knowledge mechanisms reduce but do not eliminate the performance gap.
SynthWorlds enables controlled, scalable evaluation of reasoning versus memorization.
Abstract
Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in…
Peer Reviews
Decision·ICLR 2026 Poster
1. The introduction is extremely well-written, just exquisite writing. I especially liked lines 59 through 76. 2. For synthetic multi-hop QA data generation to evaluate LMs, the paper presents the right next step by building synthetic datasets on top of the real-world Wikipedia knowledge graph that captures the complex interconnectedness and messiness. I think this paper is exciting for the field of multi-hop reasoning evaluation. 3. SynthWorlds contains a set of reasoning motifs, e.g., constrai
1. The authors heavily talk about determining knowledge gaps as the main contribution, but mention task reasoning difficulty in contribution 1 and Sec 3 line 205 as a main contribution. It is clear how SynthWorlds is evaluating knowledge gaps, but not clear at all how to determine task reasoning difficulty. It would be good to discuss this, since it's a main contribution of the work. 2. A major limitation is that SynthWorlds requires a knowledge graph to exist for a document corpus (Wikidata gra
- SYNTHWORLDS uniquely separates reasoning complexity from parametric knowledge through parallel corpora. - The framework is fully automatic and scalable, leveraging knowledge graphs (e.g., Wikidata) to generate large, interconnected corpora without manual curation.
- This paper proposes the challenges in distinguishing reasoning from reciting for controlled evaluation. However, we don't know whether the paper really solve this problem. I mean, if your scores can precisely reflect the real reasoning abilities of LLMs, then you should observe a correlation between human preference (e.g. LM Arena Rankings on Reasoning) and your scores. - This paper mentions two kinds previous approaches on controlled evaluation: (1) curation of “clean” evaluation sets and (2)
1. The paper offers a novel and well-motivated formulation of disentangling reasoning from memorized factual knowledge by constructing paired “real-mapped” and “synth-mapped” worlds with matched structure, and by introducing a formal knowledge-advantage metric to quantify the contribution of parametric knowledge. 2. The data-generation pipeline is fully automated; difficulty is explicitly controlled; the evaluations span parallel multi-hop QA and page navigation tasks; and the comparisons cover
1. Conclusions are based on two property models; it is unclear how KA scales with capacity or how post-training techniques affect KA. 2. The corpus and questions derived largely from one source (Wikidata) may yield relation distributions and writing style that favor certain generalization paths, which may disrupt the evaluation. 3. Although the benchmark proposed in the paper can quantify KA, the quantification results do not seem surprising. Moreover, it is not yet shown that improvements on Sy
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
