Constrained Policy Optimization via Sampling-Based Weight-Space Projection
Shengfan Cao, Francesco Borrelli, Eunhyek Joa

TL;DR
This paper introduces SCPO, a sampling-based method for safe policy optimization that enforces safety constraints directly in parameter space without requiring gradient information, ensuring safety and stability.
Contribution
SCPO is a novel sampling-based weight-space projection method that guarantees safety during policy learning without gradient access to constraints.
Findings
SCPO effectively rejects unsafe updates in experiments.
It maintains feasibility throughout training.
Achieves meaningful objective improvements while ensuring safety.
Abstract
Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy rollout-based safety constraints that can be evaluated but not differentiated analytically. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. SCPO constructs a local safe region by combining rollout-based safety evaluations with smoothness bounds relating parameter perturbations to changes in safety metrics, and projects each gradient update via a convex QCQP. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, SCPO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
