Improving Policy Optimization via $\varepsilon$-Retrain
Luca Marzari, Priya L. Donti, Changliu Liu, Enrico Marchesini

TL;DR
This paper introduces $\varepsilon$-retrain, a novel exploration strategy that enhances policy optimization by focusing on retraining in areas where behavioral preferences are violated, leading to improved performance and sample efficiency.
Contribution
The paper proposes an iterative retraining method with formal verification to ensure adherence to behavioral preferences during policy optimization.
Findings
Significant performance improvements across multiple tasks.
Enhanced sample efficiency demonstrated in experiments.
Formal guarantees of behavioral adherence.
Abstract
We present -retrain, an exploration strategy encouraging a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas -- parts of the state space where an agent did not satisfy the behavioral preference. Our method switches between the typical uniform restart state distribution and the retrain areas using a decaying factor , allowing agents to retrain on situations where they violated the preference. We also employ formal verification of neural networks to provably quantify the degree to which agents adhere to these behavioral preferences. Experiments over hundreds of seeds across locomotion, power network, and navigation tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Complexity and Algorithms in Graphs · Formal Methods in Verification
