Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades

Edoardo Pona; Milad Kazemi; Mehran Hosseini; Yali Du; David Watson; Osvaldo Simeone; Nicola Paoletti

arXiv:2604.14251·cs.LG·April 17, 2026

Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades

Edoardo Pona, Milad Kazemi, Mehran Hosseini, Yali Du, David Watson, Osvaldo Simeone, Nicola Paoletti

PDF

TL;DR

Calibrate-Then-Delegate (CTD) is a novel model cascade method that provides probabilistic guarantees on safety monitoring costs while making instance-level escalation decisions based on a new delegation value probe.

Contribution

The paper introduces CTD, a model cascade with a delegation value probe and budget calibration, improving safety monitoring efficiency over uncertainty-based methods.

Findings

01

CTD outperforms uncertainty-based delegation across four safety datasets.

02

It avoids harmful over-delegation and adapts to input difficulty.

03

Provides finite-sample guarantees on delegation rate.

Abstract

Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertainty is a poor proxy for delegation benefit, as it ignores whether the expert would actually correct the error. To address this problem, we introduce Calibrate-Then-Delegate (CTD), a model-cascade approach that provides probabilistic guarantees on the computation cost while enabling instance-level (streaming) decisions. CTD builds on a novel delegation value (DV) probe, a lightweight model operating on the same internal representations as the safety probe that directly predicts the benefit of escalation. To enforce budget constraints, CTD calibrates a threshold on the DV signal using held-out data via multiple hypothesis testing, yielding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.