Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models
William Overman, Mohsen Bayati

TL;DR
Conformal Arbitrage is a post hoc framework that balances competing objectives in language models by calibrating data-driven thresholds with conformal risk control, ensuring safety and utility without retraining models.
Contribution
It introduces a novel, distribution-free method for mediating between primary and conservative models at the API level, enhancing safety and efficiency in language model deployment.
Findings
Outperforms random routing in accuracy and cost.
Provides finite-sample, distribution-free safety guarantees.
Enables efficient trade-offs between objectives.
Abstract
Modern language model deployments must often balance competing objectives, for example, helpfulness versus harmlessness, cost versus accuracy, and reward versus safety. We introduce Conformal Arbitrage, a post hoc framework that learns a data driven threshold to mediate between a Primary model optimized for a primary objective and a more conservative Guardian which could be another model or a human domain expert aligned with a guardrail objective. The threshold is calibrated with conformal risk control, yielding finite sample, distribution free guarantees that the long run frequency of undesirable events, such as factual errors or safety violations, does not exceed a user specified quota. Because Conformal Arbitrage operates wholly at the API level, without requiring access to model logits or updating model weights, it complements weight based alignment techniques and integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
