Learning with Good Feature Representations in Bandits and in RL with a Generative Model
Tor Lattimore, Csaba Szepesvari, Gellert Weisz

TL;DR
This paper demonstrates that with a small approximation error in feature representations, one can efficiently find near-optimal actions in bandits and RL using few samples, leveraging the Kiefer-Wolfowitz theorem.
Contribution
It provides theoretical bounds showing how feature approximation errors affect learning efficiency in bandits and RL, with bounds independent of feature details.
Findings
A positive result using Kiefer-Wolfowitz theorem for action selection
Regret bound of order √(dn log(k)) + ε n √d log(n) in linear bandits
Approximate policy iteration achieves near-optimal policies with bounded error
Abstract
The construction by Du et al. (2019) implies that even if a learner is given linear features in that approximate the rewards in a bandit with a uniform error of , then searching for an action that is optimal up to requires examining essentially all actions. We use the Kiefer-Wolfowitz theorem to prove a positive result that by checking only a few actions, a learner can always find an action that is suboptimal with an error of at most . Thus, features are useful when the approximation error is small relative to the dimensionality of the features. The idea is applied to stochastic bandits and reinforcement learning with a generative model where the learner has access to -dimensional linear features that approximate the action-value functions for all policies to an accuracy of . For linear bandits, we prove a bound on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Reinforcement Learning in Robotics
