Enhancing RL Safety with Counterfactual LLM Reasoning
Dennis Gross, Helge Spieker

TL;DR
This paper introduces a method that uses counterfactual reasoning with large language models to improve the safety and explainability of reinforcement learning policies after training.
Contribution
It presents a novel approach combining counterfactual LLM reasoning with RL safety enhancement, which is a new application of LLMs in RL safety.
Findings
Improves RL policy safety post-training.
Enhances explainability of RL policies.
Demonstrates effectiveness of counterfactual LLM reasoning.
Abstract
Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Security and Verification in Computing · Web Application Security Vulnerabilities
