Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks
Xiang Ji, Minshuo Chen, Mengdi Wang, Tuo Zhao

TL;DR
This paper demonstrates that deep neural networks can efficiently evaluate policies in reinforcement learning by exploiting low-dimensional manifold structures, leading to sample-efficient estimators with theoretical guarantees.
Contribution
It introduces a sharp error bound for off-policy evaluation using deep networks that leverages intrinsic low-dimensional structures and a novel CNN approximation result.
Findings
Error bound depends on intrinsic dimension and policy mismatch
Sample efficiency achieved by exploiting manifold structure
CNN approximation results support theoretical analysis
Abstract
We consider the off-policy evaluation problem of reinforcement learning using deep convolutional neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Specifically, we establish a sharp error bound for fitted Q-evaluation, which depends on the intrinsic dimension of the state-action space, the smoothness of Bellman operator, and a function class-restricted -divergence. It is noteworthy that the restricted -divergence measures the behavior and target policies' {\it mismatch in the function…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Age of Information Optimization
