Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?

Hao Liang; Jiayu Cheng; Sean R. Sinclair; Yali Du

arXiv:2601.20694·cs.LG·January 29, 2026

Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?

Hao Liang, Jiayu Cheng, Sean R. Sinclair, Yali Du

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that in exogenous MDPs, pure exploitation algorithms can achieve strong regret bounds without explicit exploration, challenging the traditional belief that exploration is necessary.

Contribution

The paper introduces PEL and LSVI-PE algorithms with the first finite-sample regret guarantees for exploitation-only methods in Exo-MDPs, using novel analytical tools.

Findings

01

PEL achieves $ ilde{O}(H^2|\Xi|\sqrt{K})$ regret in tabular case

02

LSVI-PE handles large continuous spaces with polynomial regret

03

Exploitation-only methods outperform baselines in experiments

Abstract

Exogenous MDPs (Exo-MDPs) capture sequential decision-making where uncertainty comes solely from exogenous inputs that evolve independently of the learner's actions. This structure is especially common in operations research applications such as inventory control, energy storage, and resource allocation, where exogenous randomness (e.g., demand, arrivals, or prices) drives system behavior. Despite decades of empirical evidence that greedy, exploitation-only methods work remarkably well in these settings, theory has lagged behind: all existing regret guarantees for Exo-MDPs rely on explicit exploration or tabular assumptions. We show that exploration is unnecessary. We propose Pure Exploitation Learning (PEL) and prove the first general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs. In the tabular case, PEL achieves $O (H^{2} ∣Ξ∣ K)$ . For large,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

They present the first near-optimal regret bound for Exo-MDPs under linear function approximation and theoretically establish that it is independent of the endogenous state and action cardinalities. Furthermore, they rigorously prove that the exogenous process in EXO-MDPs evolves independently of the policy, thereby removing the need for explicit exploration. This is supported by $\tilde{\mathcal{O}} (\sqrt{K})$ regret guarantees in both tabular and linear function approximation settings.

Weaknesses

1. The modelling assumptions are somewhat restrictive, as the theoretical results rely on the exogenous state space being discrete and on the endogenous transition and reward functions being known. 2. The regret bound scales linearly with $|\Xi|$ in both the tabular and linear function approximation settings, which may limit scalability. In particular, LSVI-PE can exhibit degraded performance when the anchor placement is suboptimal and $\lambda_0$ becomes small.

Reviewer 02Rating 6Confidence 4

Strengths

- The paper makes a solid contribution to the RL literature. Understanding exactly when greedy exploitation is sufficient for provable guarantees is a topic that has gotten more attention lately. - This tends to occur when there is sufficient environment noise, and Exo-MDPs appear to be one such case. To my knowledge, this is novel. - The authors tackle bandits and both tabular and linear MDPs, showing that this holds somewhat more broadly than just in an isolated case. - The paper is largely

Weaknesses

1. Not much intuition is provided on exactly why such a positive result is possible. At a very very high level that probably amounts to skimming over a lot of subtleties, it seems to be because the learner can decouple the exogenous transitions, of which no exploration is necessary to learn them, from the endogenous transitions, whom are deterministic and known. Assuming that one can do so, the whole problem then reduces to learning the exogenous transitions for input to learning a Q-function vi

Reviewer 03Rating 4Confidence 3

Strengths

I think that the question tackled here is interesting, as several problems of practical interest are usually solved by algorithms that do not use any exploration mechanism.

Weaknesses

I think that the assumption about the existence of the anchor set is not strong; however, I do think that it is a strong assumption that the learning algorithm knows this set. Why is it not possible to guarantee the invertibility of $\Sigma$ by defining $\Sigma = \sum^N_{n=1} \phi_n \phi_n^T + \beta I$ where $\beta$ is a small scalar and $I$ is the identity matrix ? In this way, switching from Least Square to Ridge regression it should be possible to avoid the assumption. Why do you use the anc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Risk and Portfolio Optimization