Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees
Yu Gui, Ying Jin, Zhimei Ren

TL;DR
Conformal Alignment provides a method to identify and certify trustworthy outputs of foundation models in high-stakes tasks, ensuring alignment with human values with guarantees on average performance.
Contribution
It introduces a general conformal prediction framework for assessing when foundation model outputs meet specified alignment criteria with theoretical guarantees.
Findings
Accurately identifies trustworthy units in question answering and radiology reports.
Leverages lightweight training with moderate reference data.
Combines various features to improve alignment prediction.
Abstract
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLogic, Reasoning, and Knowledge
MethodsSparse Evolutionary Training · ALIGN
