Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound
Lin F. Yang, Mengdi Wang

TL;DR
This paper introduces MatrixRL, an online reinforcement learning algorithm that learns low-dimensional representations of transition models using features or kernels, achieving near-optimal regret bounds in high-dimensional settings.
Contribution
The paper presents the first near-optimal regret bounds for RL with feature and kernel representations, extending theoretical guarantees to high-dimensional and kernelized models.
Findings
MatrixRL achieves regret bound O(H^2 d log T √T) with features.
Kernelized MatrixRL achieves regret bound O(H^2 ˜d log T √T) with kernels.
First regret bounds for feature and kernel-based RL that are near-optimal in T and dimension.
Abstract
Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon . In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound where is the number of features. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
