Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation
Jaehyun Park, Junyeop Kwon, Dabeen Lee

TL;DR
This paper introduces a new efficient algorithm for infinite-horizon reinforcement learning with multinomial logistic function approximation, providing matching upper and lower regret bounds for both average and discounted reward settings.
Contribution
The paper develops a provably efficient value iteration-based algorithm for MNL-based RL and establishes tight regret bounds, advancing understanding of non-linear function approximation in RL.
Findings
Achieves regret bounds of D7(dDD7D7(T)) for average reward
Achieves regret bounds of D7(d(1-G)^{-2}D7D7(T)) for discounted reward
Provides several lower bounds matching the upper bounds, including for finite-horizon episodic MDPs
Abstract
We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of where is the dimension of feature mapping, is the diameter of the underlying MDP, and is the horizon. For discounted-reward MDPs, our algorithm achieves regret where is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of for learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElevator Systems and Control · Scheduling and Optimization Algorithms
