Can a Bayesian Oracle Prevent Harm from an Agent?

Yoshua Bengio; Michael K. Cohen; Nikolay Malkin; Matt MacDermott; Damiano Fornasiere; Pietro Greiner; Younesse Kaddar

arXiv:2408.05284·cs.AI·June 17, 2025

Can a Bayesian Oracle Prevent Harm from an Agent?

Yoshua Bengio, Michael K. Cohen, Nikolay Malkin, Matt MacDermott, Damiano Fornasiere, Pietro Greiner, Younesse Kaddar

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper explores a Bayesian approach to estimating context-dependent safety violation probabilities in AI systems, aiming to provide probabilistic safety guarantees and prevent harmful actions through run-time risk bounds.

Contribution

It introduces a method to derive bounds on safety violations using Bayesian hypotheses, applicable in both i.i.d. and non-i.i.d. settings, advancing probabilistic safety guarantees.

Findings

01

Derived bounds on safety violation probabilities under unknown hypotheses

02

Proposed Bayesian hypothesis search for cautious risk estimation

03

Addressed both i.i.d. and non-i.i.d. scenarios

Abstract

Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian…

Peer Reviews

Decision·UAI 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

This is a very well written paper, and it is easy to follow. The proposed approach represents a promising initial step toward designing AI systems that ensure safety through built-in probabilistic guarantees, rather than relying solely on external safety mechanisms. The authors also outline several open problems for future work.

Weaknesses

The authors present an upper bound on the harm probability, though it appears to be highly conservative. It would be valuable if they could offer a convergence rate or practical guarantees to make the framework more usable. Additionally, it is unclear how this approach compares to other conservative methods for preventing harm. Since the theoretical results lack practical assurances, I would have appreciated more experimental validation, especially in complex and realistic settings. Obtainin

Reviewer 02Rating 5Confidence 3

Strengths

S1. The topic of AI safety is timely and relevant for ICLR. S2. The theoretical results (as far as I could check) are sound. S3. The experimental evaluation serves to showcase how these bounds could be used in a realistic scenario.

Weaknesses

W1. I understand the appeal to frame this work in the context of harm by an AI agent, and I think it is an interesting point. However, there is nothing inherent to "harm" in the concept presented. The concept of "harm" could be substituted by "reward at a state" and we could be discussing the same results in a different light. I think the paper may benefit from a more general motivation. W2. While the experimental evaluation is welcome, it is a very simple example, and one wonders if these theo

Reviewer 03Rating 6Confidence 2

Strengths

- The paper is well-organized and clearly written. All the theoretical assumptions have been stated. - The proposed concentration results seem reasonable. The derivations seem technically sound. - Training large AI systems to satisfy certain safety criteria (i.e., with guardrails) is an exciting problem. This paper formulates this problem as a hypothesis-testing problem and presents non-trivial algorithms to perform the test. This problem formulation could be inspiring for other AI researchers a

Weaknesses

- The concentration result in Prop. 3.1 assumes "all theories in $M$ are distinct as probability measures." This assumption does not seem to hold many common probabilistic models. For instance, in the linear component analysis, the number of independent components is generally not uniquely discernible (i.e., not identifiable) with non-linear mixing functions. Also, the number of latent components in Gaussian mixtures is generally not identifiable from the observed data. This seems to suggest tha

Code & Models

Repositories

saifh-github/conservative-bayesian-public
pytorchOfficial

Videos

MIGHT THE ROBOTS TAKE OVER? [Prof. Yoshua Bengio]· youtube

Taxonomy

TopicsBlockchain Technology Applications and Security