Aligning Model Properties via Conformal Risk Control
William Overman, Jacqueline Jil Vallon, Mohsen Bayati

TL;DR
This paper introduces a property testing approach using conformal risk control to post-process pre-trained models for better alignment with desired behaviors, providing probabilistic guarantees and demonstrating applications on various datasets.
Contribution
It proposes a novel property testing framework with conformal risk control for model alignment, applicable to a wide range of properties and addressing biases in training data.
Findings
Conformal risk control provides probabilistic guarantees for model alignment.
The methodology applies to properties like monotonicity and concavity.
Pre-trained models require alignment techniques regardless of size or data biases.
Abstract
AI model alignment is crucial due to inadvertent biases in training data and the underspecified machine learning pipeline, where models with excellent test metrics may not meet end-user requirements. While post-training alignment via human feedback shows promise, these methods are often limited to generative AI settings where humans can interpret and provide feedback on model outputs. In traditional non-generative settings with numerical or categorical outputs, detecting misalignment through single-sample outputs remains challenging, and enforcing alignment during training requires repeating costly training processes. In this paper we consider an alternative strategy. We propose interpreting model alignment through property testing, defining an aligned model as one belonging to a subset of functions that exhibit specific desired behaviors. We focus on post-processing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSimulation Techniques and Applications
MethodsSparse Evolutionary Training · ALIGN · Focus
