Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

Pedro P. Santos; Alberto Sardinha; Francisco S. Melo

arXiv:2505.15782·cs.LG·February 10, 2026

Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

Pedro P. Santos, Alberto Sardinha, Francisco S. Melo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel method for solving infinite-horizon discounted general-utility Markov decision processes in the single-trial setting, using online planning and Monte-Carlo tree search, with demonstrated superior performance.

Contribution

It is the first to address GUMDPs in the single-trial regime, combining theoretical insights with an online planning approach for improved solutions.

Findings

01

The approach outperforms relevant baselines in experiments.

02

Fundamental results on policy optimality in the single-trial regime.

03

Demonstrates the effectiveness of Monte-Carlo tree search for GUMDPs.

Abstract

In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 2

Strengths

The paper provides some fundamental theoretical results comparing stationary, Markovian, and non-Markovian policies for the single-trial regime. Moreover, the paper characterizes the connection between the original GUMDPs and their truncation to finite-horizon MDPs in terms of regret. The paper also proves that the GUMDPs with non-Markovian policies are equivalent to occupancy MDPs with stationary policies. These theoretical findings are novel and provide interesting insights for the single-tria

Weaknesses

Although the paper empirically tests the computational performance of the MCTS-based algorithmic framework presented in this paper, it lacks its theoretical analysis. The NP-hardness of computing an optimal policy should not prevent us from deriving a finite convergence guarantee to an optimal policy ($1/\epsilon$ versus $\mathrm{log}(1/\epsilon)$). However, there is neither a regret bound nor a convergence guarantee.

Reviewer 02Rating 4Confidence 3

Strengths

The paper clearly delineates the single-trial from the infinite-trial formulation and establishes a solid method to reformulate the GUMDP as an occupancy MDP that can be solved with well established planning methods. The paper shows clear improvement over previous methods, albeit on simple domains. Overall, the authors identify a subtle problem when considering alternate MDPs with non-linear objectives.

Weaknesses

It is not entirely clear what the main theoretical contribution of the work is, since the single-trial setting is a special case of the multiple-trial setting from Santos et, al. ICML 2025. Empirically, the method was only shown for very small state and action spaces where even a random policy achieves reasonable performance. It should ideally be further tested on more complicated and larger environments, and particularly one a specific use case where single-trial evaluation is required by it

Reviewer 03Rating 8Confidence 3

Strengths

The writing quality and presentation of this paper is outstanding. In particular, the analysis performed is well-motivated and presented in a clear and easy-to-read manner. The empirical results are encouraging.

Weaknesses

I have no major concerns with this paper, and it is more or less in a publishable state. Here are some minor concerns/suggestions for the authors: - The notation related to the occupancy measure is a bit confusing. In particular, the notation of d_pi in equation 1 vs d_^pi in equation 3 onwards makes it unclear how these two terms are related. - Lines 173-174: the claim that 'practical applications often require identifying a policy that performs optimally when evaluated based on a single tra

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research

MethodsMonte-Carlo Tree Search