Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
Pierre Boudart (SIERRA), Pierre Gaillard (Thoth), Alessandro Rudi (PSL, DI-ENS, Inria)

TL;DR
This paper introduces a new regret bound for reinforcement learning in multinomial logistic MDPs that adapts to problem structure, achieving minimax optimality and improving previous bounds for certain cases.
Contribution
The authors propose a variance-aware algorithm with regret bounds that adapt to the normalized variance of the value function, improving upon existing bounds and establishing minimax optimality.
Findings
The new regret bound depends on a problem-dependent variance measure.
For KL-constrained robust MDPs, the bound reduces horizon dependence by a factor of H.
The paper proves a matching lower bound, establishing minimax optimality.
Abstract
We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of (Li et al., 2024), where is the feature dimension, the episode length, and the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant , measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of , which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\bar\sigma\_T =…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
