Massively Scaling Explicit Policy-conditioned Value Functions
Nico Bohlinger, Jan Peters

TL;DR
This paper presents a scalable approach for explicit policy-conditioned value functions (EPVFs) that enhances performance on complex continuous-control tasks through massive parallelization, novel neural architectures, and effective exploration strategies.
Contribution
It introduces a scaling strategy for EPVFs, enabling their application to challenging tasks and demonstrating competitive results against leading DRL algorithms.
Findings
EPVFs can be scaled to solve complex tasks like a custom Ant environment.
EPVFs achieve performance comparable to PPO and SAC.
Utilization of neural architectures and exploration techniques improves policy learning.
Abstract
We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V({\theta}) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, Reasoning, and Knowledge
