Tiny Moves: Game-based Hypothesis Refinement
Agnieszka Dobrowolska, Rogier Hintzen, Martin Balla, Karl Gemayel, Sabine Reichert, Thomas Charman, Jen Ning Lim, Lindsay Edwards, Anna Gogleva

TL;DR
This paper introduces The Hypothesis Game, a symbolic framework enabling LLM agents to iteratively refine scientific hypotheses through localized, domain-grounded moves, improving error correction and interpretability.
Contribution
It presents a novel game-based formalism for hypothesis refinement that emphasizes incremental reasoning, demonstrating its effectiveness over traditional prompting methods in scientific discovery tasks.
Findings
Game-based approach reduces more errors than baselines in corruption recovery.
Achieves higher precision while preserving valid hypothesis structure.
Performs comparably to top baselines in partial cue reconstruction tasks.
Abstract
Most machine learning approaches to scientific discovery frame hypotheses as end-to-end predictions, obscuring the incremental structure of scientific reasoning. We propose The Hypothesis Game, a symbolic formalism for hypothesis refinement in which LLM agents operate on a shared hypothesis state using a fixed grammar of reasoning moves. The framework is motivated by the observation that scientific progress often proceeds through small, localized revisions, grounded in domain context, rather than extensive rewrites. We instantiate a minimal game with LLM agents and evaluate it on pathway-level mechanistic refinement tasks. In the primary setting of corruption recovery, where hypotheses contain controlled errors, the game-based approach consistently removes more errors and achieves higher precision than strong prompting baselines, while preserving valid structure through incremental…
Peer Reviews
Decision·Submitted to ICLR 2026
(1) The paper proposes a novel framing of scientific reasoning as a game with explicit, interpretable moves, offering an original perspective on hypothesis refinement. (2)The paper is well organized and clearly written, making a complex concept easy to follow.
* The scope is limited to biological pathways; it is unclear how the framework performs in other scientific domains. * The controller is still prompt-based, not a learned or metric-driven system, which makes the implementation less autonomous. * The move set is small (only four operations), which may limit creativity and open-ended discovery. * The evaluation relies partly on LLM judgments and does not include human or symbolic verification.
1) The paper provides a principled and interpretable alternative to free-form LLM reasoning. 2) The reasoning grammar is well motivated and modular, allowing transparent control of reasoning styles and easy extensibility. 3) The experimental setup is carefully designed, including both reconstruction and corruption tasks on Reactome pathways.
In my opinion, the main weakness of the paper is that the scoring formalism proposed in Section 2.4 remains theoretical and untested. Other than that: 1) The evaluation lacks any assessment of statistical significance across multiple runs. This limits confidence in the reported performance differences. 2) Biological validation of the refined hypotheses is absent. While metrics are informative, there is no verification that reconstructed or corrected pathways remain biologically plausible or mech
- The paper formalizes hypothesis refinement as iterated transformations under a move budget, providing a crisp, compositional semantics. This clarity aids reuse and analysis. - A small set of moves (prune, expand, retrieve, debate) is specified and mapped to agent responsibilities. This fosters generality while remaining practical. The grammar supports composition and repeated application, encouraging modular reasoning and stepwise traceability. Conceptual visualization (Fig. 1) improves read
- Scope of baselines and diagnostics is narrow. Baselines are limited to prompting (Zero-Shot, Chain-of-Thought, ReAct), omitting other structured controllers or graph-based planners, which constrains external validity. No ablation on the move grammar (e.g., removing debate or retrieve) or on the move budget to isolate which components drive gains. - Formal elements are not fully operationalized. Modes \(\pi_M\) are implemented via prompt text rather than as an explicit policy; scoring \(S(H_t)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Scientific Computing and Data Management · Explainable Artificial Intelligence (XAI)
