From Relative Entropy to Minimax: A Unified Framework for Coverage in MDPs

Xihe Gu; Urbashi Mitra; Tara Javidi

arXiv:2601.11890·cs.LG·January 21, 2026

From Relative Entropy to Minimax: A Unified Framework for Coverage in MDPs

Xihe Gu, Urbashi Mitra, Tara Javidi

PDF

Open Access

TL;DR

This paper introduces a unified framework for exploration in reward-free MDPs using a family of concave coverage objectives, enabling explicit control over exploration priorities and recovering worst-case strategies as a parameter grows.

Contribution

It proposes a novel family of coverage objectives over occupancy measures that unifies existing approaches and provides a gradient-based method for targeted exploration in MDPs.

Findings

01

The framework unifies divergence-based, weighted, and minimax coverage objectives.

02

The gradient-based algorithm effectively directs exploration towards under-explored state-action pairs.

03

Increasing the parameter emphasizes worst-case coverage, recovering minimax strategies.

Abstract

Targeted and deliberate exploration of state--action pairs is essential in reward-free Markov Decision Problems (MDPs). More precisely, different state-action pairs exhibit different degree of importance or difficulty which must be actively and explicitly built into a controlled exploration strategy. To this end, we propose a weighted and parameterized family of concave coverage objectives, denoted by $U_{ρ}$ , defined directly over state--action occupancy measures. This family unifies several widely studied objectives within a single framework, including divergence-based marginal matching, weighted average coverage, and worst-case (minimax) coverage. While the concavity of $U_{ρ}$ captures the diminishing return associated with over-exploration, the simple closed form of the gradient of $U_{ρ}$ enables an explicit control to prioritize under-explored state--action pairs. Leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Bayesian Modeling and Causal Inference