Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning
Pedro P. Santos, Alberto Sardinha, Francisco S. Melo

TL;DR
This paper introduces a novel method for solving infinite-horizon discounted general-utility Markov decision processes in the single-trial setting, using online planning and Monte-Carlo tree search, with demonstrated superior performance.
Contribution
It is the first to address GUMDPs in the single-trial regime, combining theoretical insights with an online planning approach for improved solutions.
Findings
The approach outperforms relevant baselines in experiments.
Fundamental results on policy optimality in the single-trial regime.
Demonstrates the effectiveness of Monte-Carlo tree search for GUMDPs.
Abstract
In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.
Peer Reviews
Decision·ICLR 2026 Poster
The paper provides some fundamental theoretical results comparing stationary, Markovian, and non-Markovian policies for the single-trial regime. Moreover, the paper characterizes the connection between the original GUMDPs and their truncation to finite-horizon MDPs in terms of regret. The paper also proves that the GUMDPs with non-Markovian policies are equivalent to occupancy MDPs with stationary policies. These theoretical findings are novel and provide interesting insights for the single-tria
Although the paper empirically tests the computational performance of the MCTS-based algorithmic framework presented in this paper, it lacks its theoretical analysis. The NP-hardness of computing an optimal policy should not prevent us from deriving a finite convergence guarantee to an optimal policy ($1/\epsilon$ versus $\mathrm{log}(1/\epsilon)$). However, there is neither a regret bound nor a convergence guarantee.
The paper clearly delineates the single-trial from the infinite-trial formulation and establishes a solid method to reformulate the GUMDP as an occupancy MDP that can be solved with well established planning methods. The paper shows clear improvement over previous methods, albeit on simple domains. Overall, the authors identify a subtle problem when considering alternate MDPs with non-linear objectives.
It is not entirely clear what the main theoretical contribution of the work is, since the single-trial setting is a special case of the multiple-trial setting from Santos et, al. ICML 2025. Empirically, the method was only shown for very small state and action spaces where even a random policy achieves reasonable performance. It should ideally be further tested on more complicated and larger environments, and particularly one a specific use case where single-trial evaluation is required by it
The writing quality and presentation of this paper is outstanding. In particular, the analysis performed is well-motivated and presented in a clear and easy-to-read manner. The empirical results are encouraging.
I have no major concerns with this paper, and it is more or less in a publishable state. Here are some minor concerns/suggestions for the authors: - The notation related to the occupancy measure is a bit confusing. In particular, the notation of d_pi in equation 1 vs d_^pi in equation 3 onwards makes it unclear how these two terms are related. - Lines 173-174: the claim that 'practical applications often require identifying a policy that performs optimally when evaluated based on a single tra
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research
MethodsMonte-Carlo Tree Search
