Consensus Sampling for Safer Generative AI

Adam Tauman Kalai; Yael Tauman Kalai; Or Zamir

arXiv:2511.09493·cs.AI·May 12, 2026

Consensus Sampling for Safer Generative AI

Adam Tauman Kalai, Yael Tauman Kalai, Or Zamir

PDF

TL;DR

This paper introduces consensus sampling, a robust aggregation method for generative AI safety that combines multiple distributions to mitigate risks and abstains when agreement is insufficient.

Contribution

It proposes a novel, model-agnostic algorithm for safe distribution aggregation with formal guarantees and demonstrates its effectiveness through experiments.

Findings

01

Consensus sampling achieves risk levels comparable to the safest subset.

02

The method provides formal R-robustness guarantees against adversarial influence.

03

Experiments show the approach works on synthetic and image generation tasks.

Abstract

Motivated by undetectable risks in generative AI, we study a general robust aggregation problem: how to aggregate several probability distributions to boost safety. We present consensus sampling, a black-box algorithm that, given k distributions, has risk competitive with the average risk of the safest $s$ while abstaining when there is insufficient agreement. This yields an architecture-agnostic approach to generative-model safety when the distributions are induced by models that can sample and evaluate output probabilities. We formalize the guarantee through R-robustness, which also bounds information leakage and adversarial influence. Inspired by robust statistics and the provable copyright protection algorithm of Vyas et al (2023), we show that while a standard mixture is vulnerable to one unsafe constituent, a pointwise-median construction provides robust intuition, and our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.