Learning Optimal and Sample-Efficient Decision Policies with Guarantees
Daqian Shao

TL;DR
This paper introduces a sample-efficient, guaranteed method for learning decision policies from offline data with hidden confounders, applicable to high-stakes domains like healthcare and finance.
Contribution
It develops a novel algorithm based on instrumental variables and CMR to address hidden confounders, improving sample efficiency and providing convergence guarantees.
Findings
Outperforms state-of-the-art algorithms in sample efficiency
Successfully learns effective policies from offline datasets with confounders
Demonstrates applicability to real-world decision-making benchmarks
Abstract
The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Advanced Bandit Algorithms Research
