Case-Guided Sequential Assay Planning in Drug Discovery
Tianchi Chen, Jan Bima, Sean L. Wu, Otto Ritter, Bingjia Yang, and Xiang Yu

TL;DR
This paper introduces IBMDP, a model-based reinforcement learning framework that efficiently plans sequential assays in drug discovery using historical data, significantly reducing resource use while maintaining decision quality.
Contribution
IBMDP is a novel Bayesian model-based RL approach that constructs implicit transition models from historical data and employs ensemble MCTS for resource-efficient assay planning.
Findings
IBMDP reduced resource consumption by up to 92% in real-world drug discovery tasks.
IBMDP outperformed heuristic methods in decision confidence and efficiency.
In synthetic benchmarks, IBMDP closely matched the optimal policy, outperforming deterministic value iteration.
Abstract
Optimally sequencing experimental assays in drug discovery is a high-stakes planning problem under severe uncertainty and resource constraints. A primary obstacle for standard reinforcement learning (RL) is the absence of an explicit environment simulator or transition data ; planning must rely solely on a static database of historical outcomes. We introduce the Implicit Bayesian Markov Decision Process (IBMDP), a model-based RL framework designed for such simulator-free settings. IBMDP constructs a case-guided implicit model of transition dynamics by forming a nonparametric belief distribution using similar historical outcomes. This mechanism enables Bayesian belief updating as evidence accumulates and employs ensemble MCTS planning to generate stable policies that balance information gain toward desired outcomes with resource efficiency. We validate IBMDP through…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper clearly identifies a gap of planning without simulators and formalizes a principled approach via implicit dynamics. The appendix convincingly reinterprets the similarity mechanism as Bayesian belief updating in a POMDP where the latent variable indexes historical prototypes. This elevates what could be seen as heuristic into a grounded probabilistic framework. The approach bridges case-based reasoning, kernel RL, and Bayesian experimental design, offering a coherent hybrid that is both
- There is a lack of comparison with existing causal bayesian optimization approaches and the authors do not cite relevant work such as Durand et al 2025 https://arxiv.org/pdf/2503.19554 and other CBO works. - The approach is highly dependent on historical coverage: that is, it can only sample from observed compound profiles. This means it cannot generalize beyond the chemical or assay distribution of the historical dataset. This is acknowledged but severely limits applicability in novel disco
- The paper addresses an important problem: decision-making without simulators is practically useful, and the paper is well-motivated. - The proposed method provides similarity-weighted sampling that is intuitive and computationally tractable. - The paper is generally well-written. - The empirical evaluation of IBMDP includes both a real-world drug discovery task and a synthetic benchmark.
- For the real-world drug discovery task, the baselines are insufficient. Comparisons are performed against rule-based heuristics; e.g., kNN-Thompson alternatives compatible with the same posterior predictive and constraints would strengthen the effectiveness of IBMDP. - Theoretical analysis: - - No convergence guarantees or regret bounds are provided. Unlike other Bayesian RL methods with proven regret bounds, IBMDP offers empirical robustness. - - The provided consistency proof is weak. Th
1. This paper studies an important but underexplored problem, sequential experimental design when only static historical data exists, without simulators or explicit transition tuples, and it is well-motivated by real pharmaceutical constraints where mechanistic models are unavailable. 2. The authors provide theoretical analysis by formalizing the proposed approach as a POMDP and proving that similarity weight updates implement exact Bayesian belief updates. 3. The authors conducted experiments
1. The proposed method highly depends on the quality of the historical dataset $\mathcal{D}$. Unlike model-free RL, it cannot discover strategies that are not present in the historical data, so any gaps or biases in the data can lead to suboptimal decisions. It would be better if the authors could discuss whether this is a valid concern and how to address it. 2. The authors' primary claim of practical utility is on the real-world case study, but the only baseline used is a rule-based decision s
- Originality and significance: This reviewer believe these areas are lacking (see weakness). In particular, the generalizability of the proposed method to other domains or other experiments seems lacking unless the proposed weight function can be justified through prior literature or strong empirical performance with the real-world dataset. - Quality: Experiments could be improved. - Clarity: The current presentation of the ideas in the methods section is clear. Some connections to related wo
# Method: - Lacking justification for the assumption about the similarity weights: It's unclear (at least the current manuscript is lacking justification from the relevant field) that similar molecular structures behave similarly in terms of the target property, and whether the proposed similarity function accurately captures this relationship. This requires either grounding in molecular structure literature or rigorous justification with real world experiments which the current paper does not p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms · Advanced Bandit Algorithms Research
