Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures
Marcus Hutter

TL;DR
This paper demonstrates that Bayesian mixture-based policies in unknown environments are both self-optimizing, converging to optimal rewards, and Pareto-optimal, outperforming other policies across all considered environments.
Contribution
It establishes that Bayes-optimal policies derived from mixture distributions are both self-optimizing and Pareto-optimal in general probabilistic environments without structural assumptions.
Findings
Bayes-optimal policies converge to the best possible reward in unknown environments.
Self-optimizing policies exist if the environment class admits them.
Bayes-optimal policies are Pareto-optimal across all environments in the class.
Abstract
The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle action results in perception and reward , where all quantities in general may depend on the complete history. The perception and reward are sampled from the (reactive) environmental probability distribution . This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if is known. Reinforcement learning is usually used if is unknown. In the Bayesian approach one defines a mixture distribution as a weighted sum of distributions , where is any class of distributions including the true environment . We show that the Bayes-optimal policy based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
