Sample-Optimal Parametric Q-Learning Using Linearly Additive Features
Lin F. Yang, Mengdi Wang

TL;DR
This paper introduces a sample-efficient parametric Q-learning algorithm for large-scale MDPs with linear features, achieving near-optimal sample complexity by leveraging monotonicity and noise structure.
Contribution
It proposes a novel parametric Q-learning method with provable sample optimality that scales with feature dimension, independent of state space size, and incorporates variance reduction techniques.
Findings
Achieves $ ilde{O}(K/\epsilon^2(1-\gamma)^3)$ sample complexity
Proves a matching information-theoretical lower bound
Demonstrates effectiveness in large-scale MDPs with linear features
Abstract
Consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express the process's probabilistic transition model. We propose a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension and invariant with respect to the size of the state space. To further improve its sample efficiency, we exploit the monotonicity property and intrinsic noise structure of the Bellman operator, provided the existence of anchor state-actions that imply implicit non-negativity in the feature space. We augment the algorithm using techniques of variance reduction, monotonicity preservation, and confidence bounds. It is proved to find a policy which is -optimal from any initial state with high probability using sample transitions for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Fault Detection and Control Systems · Control Systems and Identification
MethodsQ-Learning
