Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution
Senne Deproost, Denis Steckelmacher, Ann Now\'e

TL;DR
This paper introduces a method to directly synthesize human-readable programs as reinforcement learning policies during training, improving transparency and efficiency over traditional post-hoc distillation methods.
Contribution
It proposes a novel approach combining TD3 critics with genetic algorithms to learn interpretable programs in real-time during training.
Findings
Demonstrates high sample-efficiency in a gridworld environment
Shows improved explainability of learned policies
Validates the approach's effectiveness and transparency
Abstract
With Deep Reinforcement Learning (DRL) being increasingly considered for the control of real-world systems, the lack of transparency of the neural network at the core of RL becomes a concern. Programmatic Reinforcement Learning (PRL) is able to to create representations of this black-box in the form of source code, not only increasing the explainability of the controller but also allowing for user adaptations. However, these methods focus on distilling a black-box policy into a program and do so after learning using the Mean Squared Error between produced and wanted behaviour, discarding other elements of the RL algorithm. The distilled policy may therefore perform significantly worse than the black-box learned policy. In this paper, we propose to directly learn a program as the policy of an RL agent. We build on TD3 and use its critics as the basis of the objective function of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications
MethodsDense Connections · Target Policy Smoothing · Adam · Clipped Double Q-learning · Experience Replay · *Communicated@Fast*How Do I Communicate to Expedia? · Twin Delayed Deep Deterministic · Focus
