Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy
Cameron Allen, Aaron Kirtland, Ruo Yu Tao, Sam Lobel, Daniel Scott,, Nicholas Petrocelli, Omer Gottesman, Ronald Parr, Michael L. Littman, George, Konidaris

TL;DR
This paper introduces the λ-discrepancy metric, which detects non-Markovian states in partially observable environments and helps improve reinforcement learning by minimizing this discrepancy to learn better memory functions.
Contribution
The paper proposes the λ-discrepancy metric to identify partial observability and demonstrates how minimizing it enhances learning in partially observable environments.
Findings
λ-discrepancy is zero in Markov decision processes
Minimizing λ-discrepancy improves learning in POMDPs
Proposed method outperforms single-value network baselines
Abstract
Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to -- or knowledge of -- an underlying, unobservable state space. Our metric, the -discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD() with a different value of . Since TD() makes an implicit Markov assumption and TD() does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSimulation Techniques and Applications
