Enhancing RL Safety with Counterfactual LLM Reasoning

Dennis Gross; Helge Spieker

arXiv:2409.10188·cs.LG·September 17, 2024

Enhancing RL Safety with Counterfactual LLM Reasoning

Dennis Gross, Helge Spieker

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method that uses counterfactual reasoning with large language models to improve the safety and explainability of reinforcement learning policies after training.

Contribution

It presents a novel approach combining counterfactual LLM reasoning with RL safety enhancement, which is a new application of LLMs in RL safety.

Findings

01

Improves RL policy safety post-training.

02

Enhances explainability of RL policies.

03

Demonstrates effectiveness of counterfactual LLM reasoning.

Abstract

Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lava-lab/cool-mc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research · Security and Verification in Computing · Web Application Security Vulnerabilities