Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Charafeddine Mouzouni

arXiv:2602.21368·cs.LG·February 26, 2026

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Charafeddine Mouzouni

PDF

Open Access

TL;DR

This paper introduces a method combining self-consistency sampling and conformal calibration to provide reliable confidence levels for black-box AI systems, ensuring trustworthy deployment with finite-sample guarantees.

Contribution

It presents a novel black-box reliability certification technique that offers exact, distribution-free confidence guarantees applicable across various models and datasets.

Findings

01

Reliability levels correlate with model strength, not accuracy.

02

Validation across multiple benchmarks shows high conditional coverage.

03

Sequential stopping reduces API costs by approximately 50%.

Abstract

Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software System Performance and Reliability · Adversarial Robustness in Machine Learning