Guided Policy Optimization under Partial Observability
Yueheng Li, Guangming Xie, Zongqing Lu

TL;DR
This paper introduces Guided Policy Optimization (GPO), a novel reinforcement learning framework for partially observable environments that leverages privileged information to improve policy learning, achieving near-optimal performance and outperforming existing methods.
Contribution
The paper proposes GPO, a new framework that co-trains a guider and a learner, effectively utilizing privileged information to enhance policy learning under partial observability.
Findings
GPO achieves near-optimal performance theoretically.
GPO significantly outperforms existing methods empirically.
GPO is effective in continuous control, noisy, and memory-based tasks.
Abstract
Reinforcement Learning (RL) in partially observable environments poses significant challenges due to the complexity of learning under uncertainty. While additional information, such as that available in simulations, can enhance training, effectively leveraging it remains an open problem. To address this, we introduce Guided Policy Optimization (GPO), a framework that co-trains a guider and a learner. The guider takes advantage of privileged information while ensuring alignment with the learner's policy that is primarily trained via imitation learning. We theoretically demonstrate that this learning scheme achieves optimality comparable to direct RL, thereby overcoming key limitations inherent in existing approaches. Empirical evaluations show strong performance of GPO across various tasks, including continuous control with partial observability and noise, and memory-based challenges,…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea of the guider's backtracking step is novel and interesting. It interprets the principled solution to the "impossibly good" teacher problem. By actively keeping the privileged guider's policy within the learner's reachable policy region, GPO ensures that the supervision remains beneficial. - The implementation of the GPO method is quite clever, with additional state and observation and incorporated into PPO policy improvement. - The method is rigorously tested across a diverse set of
- The main concern of the paper is the mismatch between the theory part and the practical implementation. For example, in Section 3.1, the GPO iteration is performed in four steps in sequence, but the actual loss of the 4 steps is combined in one policy improvement step in PPO, which might result in an optimisation issue. Besides, the use of L4 loss is not quite obvious, since the GPO iteration should work even if the learner doesn't have a value function. - There are some unclear design choice
- Well-written, clear writing - Relevant problem - Good empirical support I can't vouch for its originality though as I'm not well familiar with the literature
- Despite Theorem 1, the proposed method is mostly heuristic, and good empirical performance might not replicate to other
\+ The idea of co-training a teacher and student policy is intriguing. \+ The proposed method is technically solid. \+ Experiments seem comprehensive on SOTA locomotion tasks.
- The training process involves hyperparameters and seems intricate to tune. - Frequent heuristic descriptions: Terms such as *“possibly good region,” “impossibly good,”* and *“inimitable”* require clearer definitions. I would suggest that authors provide a more formal characterization of these concepts. - A few claims may not be factual: * Authors used TSL to refer to IL and policy distillation "as there is no fundamental distinction between them." (Line 82). However, IL usually refers t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications
